This paper considers decentralized stochastic optimization over a network of $n$ nodes, where each node possesses a smooth non-convex local cost function and the goal of the networked nodes is to find an $\epsilon$-accurate first-order stationary point of the sum of the local costs. We focus on an online setting, where each node accesses its local cost only by means of a stochastic first-order oracle that returns a noisy version of the exact gradient. In this context, we propose a novel single-loop decentralized hybrid variance-reduced stochastic gradient method, called GT-HSGD, that outperforms the existing approaches in terms of both the oracle complexity and practical implementation. The GT-HSGD algorithm implements specialized local hybrid stochastic gradient estimators that are fused over the network to track the global gradient. Remarkably, GT-HSGD achieves a network topology-independent oracle complexity of $O(n^{-1}\epsilon^{-3})$ when the required error tolerance $\epsilon$ is small enough, leading to a linear speedup with respect to the centralized optimal online variance-reduced approaches that operate on a single node. Numerical experiments are provided to illustrate our main technical results.
A promising technique for the spectral design of acoustic metamaterials is based on the formulation of suitable constrained nonlinear optimization problems. Unfortunately, the straightforward application of classical gradient-based iterative optimization algorithms to the numerical solution of such problems is typically highly demanding, due to the complexity of the underlying physical models. Nevertheless, supervised machine learning techniques can reduce such a computational effort, e.g., by replacing the original objective functions of such optimization problems with more-easily computable approximations. In this framework, the present article describes the application of a related unsupervised machine learning technique, namely, principal component analysis, to approximate the gradient of the objective function of a band gap optimization problem for an acoustic metamaterial, with the aim of making the successive application of a gradient-based iterative optimization algorithm faster. Numerical results show the effectiveness of the proposed method.
Decentralized optimization and communication compression have exhibited their great potential in accelerating distributed machine learning by mitigating the communication bottleneck in practice. While existing decentralized algorithms with communication compression mostly focus on the problems with only smooth components, we study the decentralized stochastic composite optimization problem with a potentially non-smooth component. A \underline{Prox}imal gradient \underline{L}in\underline{EA}r convergent \underline{D}ecentralized algorithm with compression, Prox-LEAD, is proposed with rigorous theoretical analyses in the general stochastic setting and the finite-sum setting. Our theorems indicate that Prox-LEAD works with arbitrary compression precision, and it tremendously reduces the communication cost almost for free. The superiorities of the proposed algorithms are demonstrated through the comparison with state-of-the-art algorithms in terms of convergence complexities and numerical experiments. Our algorithmic framework also generally enlightens the compressed communication on other primal-dual algorithms by reducing the impact of inexact iterations, which might be of independent interest.
We deal with the problem of parameter estimation in stochastic differential equations (SDEs) in a partially observed framework. We aim to design a method working for both elliptic and hypoelliptic SDEs, the latters being characterized by degenerate diffusion coefficients. This feature often causes the failure of contrast estimators based on Euler Maruyama discretization scheme and dramatically impairs classic stochastic filtering methods used to reconstruct the unobserved states. All of theses issues make the estimation problem in hypoelliptic SDEs difficult to solve. To overcome this, we construct a well-defined cost function no matter the elliptic nature of the SDEs. We also bypass the filtering step by considering a control theory perspective. The unobserved states are estimated by solving deterministic optimal control problems using numerical methods which do not need strong assumptions on the diffusion coefficient conditioning. Numerical simulations made on different partially observed hypoelliptic SDEs reveal our method produces accurate estimate while dramatically reducing the computational price comparing to other methods.
We introduce a methodology for online estimation of smoothing expectations for a class of additive functionals, in the context of a rich family of diffusion processes (that may include jumps) -- observed at discrete-time instances. We overcome the unavailability of the transition density of the underlying SDE by working on the augmented pathspace. The new method can be applied, for instance, to carry out online parameter inference for the designated class of models. Algorithms defined on the infinite-dimensional pathspace have been developed in the last years mainly in the context of MCMC techniques. There, the main benefit is the achievement of mesh-free mixing times for the practical time-discretised algorithm used on a PC. Our own methodology sets up the framework for infinite-dimensional online filtering -- an important positive practical consequence is the construct of estimates with the variance that does not increase with decreasing mesh-size. Besides regularity conditions, our method is, in principle, applicable under the weak assumption -- relatively to restrictive conditions often required in the MCMC or filtering literature of methods defined on pathspace -- that the SDE covariance matrix is invertible.
The local minimax method (LMM) proposed in [Y. Li and J. Zhou, SIAM J. Sci. Comput. 23(3), 840--865 (2001)] and [Y. Li and J. Zhou, SIAM J. Sci. Comput. 24(3), 865--885 (2002)] is an efficient method to solve nonlinear elliptic partial differential equations (PDEs) with certain variational structures for multiple solutions. The steepest descent direction and the Armijo-type step-size search rules are adopted in the above work and playing a significant role in the performance and convergence analysis of traditional LMMs. In this paper, a new algorithm framework of the LMMs is established based on general descent directions and two normalized (strong) Wolfe-Powell-type step-size search rules. The corresponding algorithm named as the normalized Wolfe-Powell-type LMM (NWP-LMM) are introduced with its feasibility and global convergence rigorously justified for general descent directions. As a special case, the global convergence of the NWP-LMM algorithm combined with the preconditioned steepest descent (PSD) directions is also verified. Consequently, it extends the framework of traditional LMMs. In addition, conjugate gradient-type (CG-type) descent directions are utilized to speed up the LMM algorithms. Finally, extensive numerical results for several semilinear elliptic PDEs are reported to profile their multiple unstable solutions and compared for different algorithms in the LMM's family to indicate the effectiveness and robustness of our algorithms. In practice, the NWP-LMM combined with the CG-type direction indeed performs much better among its LMM companions.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
We propose accelerated randomized coordinate descent algorithms for stochastic optimization and online learning. Our algorithms have significantly less per-iteration complexity than the known accelerated gradient algorithms. The proposed algorithms for online learning have better regret performance than the known randomized online coordinate descent algorithms. Furthermore, the proposed algorithms for stochastic optimization exhibit as good convergence rates as the best known randomized coordinate descent algorithms. We also show simulation results to demonstrate performance of the proposed algorithms.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
This work considers the problem of provably optimal reinforcement learning for episodic finite horizon MDPs, i.e. how an agent learns to maximize his/her long term reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP, $S$ is the number of states, $A$ is the number of actions, and $H$ is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve $\epsilon$-average regret for any constant $\epsilon$ is $O(SA)$, which is a number of samples that is far less than that required to learn any non-trivial estimate of the transition model (the transition model is specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as $Q$-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the $\Omega(\sqrt{HSAT})$ lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in $Q$-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the asymptotically optimal regret.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.