We introduce a new algorithm for expected log-likelihood maximization in situations where the objective function is multi-modal and/or has saddle points, that we term G-PFSO. The key idea underpinning G-PFSO is to define a sequence of probability distributions which (a) is shown to concentrate on the target parameter value and (b) can be efficiently estimated by means of a standard particle filter algorithm. These distributions depends on a learning rate, where the faster the learning rate is the faster is the rate at which they concentrate on the desired parameter value but the lesser is the ability of G-PFSO to escape from a local optimum of the objective function. To conciliate ability to escape from a local optimum and fast convergence rate, the proposed estimator exploits the acceleration property of averaging, well-known in the stochastic gradient literature. Based on challenging estimation problems, our numerical experiments suggest that the estimator introduced in this paper converges at the optimal rate, and illustrate the practical usefulness of G-PFSO for parameter inference in large datasets. If the focus of this work is expected log-likelihood maximization the proposed approach and its theory apply more generally for optimizing a function defined through an expectation.
An operator-splitting finite element scheme for the time-dependent, high-dimensional radiative transfer equation is presented in this paper. The streamline upwind Petrov-Galerkin finite element method and discontinuous Galerkin finite element method are used for the spatial-angular discretization of the radiative transfer equation, whereas the implicit backward Euler scheme is used for temporal discretization. Error analysis of the proposed numerical scheme for the fully discrete radiative transfer equation is presented. The stability and convergence estimates for the fully discrete problem are derived. Moreover, an operator-splitting algorithm for numerical simulation of high-dimensional equations is also presented. The validation of the derived estimates and implementation is demonstrated with appropriate numerical experiments.
Learning in stochastic games is arguably the most standard and fundamental setting in multi-agent reinforcement learning (MARL). In this paper, we consider decentralized MARL in stochastic games in the non-asymptotic regime. In particular, we establish the finite-sample complexity of fully decentralized Q-learning algorithms in a significant class of general-sum stochastic games (SGs) - weakly acyclic SGs, which includes the common cooperative MARL setting with an identical reward to all agents (a Markov team problem) as a special case. We focus on the practical while challenging setting of fully decentralized MARL, where neither the rewards nor the actions of other agents can be observed by each agent. In fact, each agent is completely oblivious to the presence of other decision makers. Both the tabular and the linear function approximation cases have been considered. In the tabular setting, we analyze the sample complexity for the decentralized Q-learning algorithm to converge to a Markov perfect equilibrium (Nash equilibrium). With linear function approximation, the results are for convergence to a linear approximated equilibrium - a new notion of equilibrium that we propose - which describes that each agent's policy is a best reply (to other agents) within a linear space. Numerical experiments are also provided for both settings to demonstrate the results.
Federated optimization (FedOpt), which targets at collaboratively training a learning model across a large number of distributed clients, is vital for federated learning. The primary concerns in FedOpt can be attributed to the model divergence and communication efficiency, which significantly affect the performance. In this paper, we propose a new method, i.e., LoSAC, to learn from heterogeneous distributed data more efficiently. Its key algorithmic insight is to locally update the estimate for the global full gradient after {each} regular local model update. Thus, LoSAC can keep clients' information refreshed in a more compact way. In particular, we have studied the convergence result for LoSAC. Besides, the bonus of LoSAC is the ability to defend the information leakage from the recent technique Deep Leakage Gradients (DLG). Finally, experiments have verified the superiority of LoSAC comparing with state-of-the-art FedOpt algorithms. Specifically, LoSAC significantly improves communication efficiency by more than $100\%$ on average, mitigates the model divergence problem and equips with the defense ability against DLG.
Zeroth-order optimization methods are developed to overcome the practical hurdle of having knowledge of explicit derivatives. Instead, these schemes work with merely access to noisy functions evaluations. The predominant approach is to mimic first-order methods by means of some gradient estimator. The theoretical limitations are well-understood, yet, as most of these methods rely on finite-differencing for shrinking differences, numerical cancellation can be catastrophic. The numerical community developed an efficient method to overcome this by passing to the complex domain. This approach has been recently adopted by the optimization community and in this work we analyze the practically relevant setting of dealing with computational noise. To exemplify the possibilities we focus on the strongly-convex optimization setting and provide a variety of non-asymptotic results, corroborated by numerical experiments, and end with local non-convex optimization.
In many numerical simulations stochastic gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) but till this day it remains an open problem of research to provide a mathematical convergence analysis which rigorously explains the success of SGD type optimization methods in the training of DNNs. In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation. We first establish general regularity properties for the risk functions and their generalized gradient functions appearing in the training of such DNNs and, thereafter, we investigate the plain vanilla SGD optimization method in the training of such DNNs under the assumption that the target function under consideration is a constant function. Specifically, we prove under the assumption that the learning rates (the step sizes of the SGD optimization method) are sufficiently small but not $L^1$-summable and under the assumption that the target function is a constant function that the expectation of the riskof the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity.
Devising optimal interventions for constraining stochastic systems is a challenging endeavour that has to confront the interplay between randomness and nonlinearity. Existing methods for identifying the necessary dynamical adjustments resort either to space discretising solutions of ensuing partial differential equations, or to iterative stochastic path sampling schemes. Yet, both approaches become computationally demanding for increasing system dimension. Here, we propose a generally applicable and practically feasible non-iterative methodology for obtaining optimal dynamical interventions for diffusive nonlinear systems. We estimate the necessary controls from an interacting particle approximation to the logarithmic gradient of two forward probability flows evolved following deterministic particle dynamics. Applied to several biologically inspired models, we show that our method provides the necessary optimal controls in settings with terminal-, transient-, or generalised collective-state constraints and arbitrary system dynamics.
Let $P$ be a set of points in $\mathbb{R}^d$, where each point $p\in P$ has an associated transmission range $\rho(p)$. The range assignment $\rho$ induces a directed communication graph $\mathcal{G}_{\rho}(P)$ on $P$, which contains an edge $(p,q)$ iff $|pq| \leq \rho(p)$. In the broadcast range-assignment problem, the goal is to assign the ranges such that $\mathcal{G}_{\rho}(P)$ contains an arborescence rooted at a designated node and whose cost $\sum_{p \in P} \rho(p)^2$ is minimized. We study trade-offs between the stability of the solution -- the number of ranges that are modified when a point is inserted into or deleted from $P$ -- and its approximation ratio. We introduce $k$-stable algorithms, which are algorithms that modify the range of at most $k$ points when they update the solution. We also introduce the concept of a stable approximation scheme (SAS). A SAS is an update algorithm that, for any given fixed parameter $\varepsilon>0$, is $k(\epsilon)$-stable and maintains a solution with approximation ratio $1+\varepsilon$, where the stability parameter $k(\varepsilon)$ only depends on $\varepsilon$ and not on the size of $P$. We study such trade-offs in three settings. - In $\mathbb{R}^1$, we present a SAS with $k(\varepsilon)=O(1/\varepsilon)$, which we show is tight in the worst case. We also present a 1-stable $(6+2\sqrt{5})$-approximation algorithm, a $2$-stable 2-approximation algorithm, and a $3$-stable $1.97$-approximation algorithm. - In $\mathbb{S}^1$ (where the underlying space is a circle) we prove that no SAS exists, even though an optimal solution can always be obtained by cutting the circle at an appropriate point and solving the resulting problem in $\mathbb{R}^1$. - In $\mathbb{R}^2$, we also prove that no SAS exists, and we present a $O(1)$-stable $O(1)$-approximation algorithm.
Stochastic gradient descent with momentum (SGDM) is the dominant algorithm in many optimization scenarios, including convex optimization instances and non-convex neural network training. Yet, in the stochastic setting, momentum interferes with gradient noise, often leading to specific step size and momentum choices in order to guarantee convergence, set aside acceleration. Proximal point methods, on the other hand, have gained much attention due to their numerical stability and elasticity against imperfect tuning. Their stochastic accelerated variants though have received limited attention: how momentum interacts with the stability of (stochastic) proximal point methods remains largely unstudied. To address this, we focus on the convergence and stability of the stochastic proximal point algorithm with momentum (SPPAM), and show that SPPAM allows a faster linear convergence to a neighborhood compared to stochastic proximal point algorithm (SPPA) with a better contraction factor, under proper hyperparameter tuning. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM, allowing a wider range of step size and momentum that lead to convergence.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.