We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning. First, we propose a new primal-dual algorithm, which we call PDDY, for this problem. It is constructed by applying Davis-Yin splitting to a monotone inclusion in a primal-dual product space, where the operators are monotone under a specific metric depending on L. We show that three existing algorithms (the two forms of the Condat-Vu algorithm and the PD3O algorithm) have the same structure, so that PDDY is the fourth missing link in this self-consistent class of primal-dual algorithms. This representation eases the convergence analysis: it allows us to derive sublinear convergence rates in general, and linear convergence results in presence of strong convexity. Moreover, within our broad and flexible analysis framework, we propose new stochastic generalizations of the algorithms, in which a variance-reduced random estimate of the gradient of F is used, instead of the true gradient. Furthermore, we obtain, as a special case of PDDY, a linearly converging algorithm for the minimization of a strongly convex function F under a linear constraint; we discuss its important application to decentralized optimization.
Linear programming (LP) is an extremely useful tool which has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as $\ell_1$-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. In this paper, we consider both feasible and infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data.
We consider unconstrained optimization problems with nonsmooth and convex objective function in the form of mathematical expectation. The proposed method approximates the objective function with a sample average function by using different sample size in each iteration. The sample size is chosen in an adaptive manner based on the Inexact Restoration. The method uses line search and assumes descent directions with respect to the current approximate function. We prove the almost sure convergence under the standard assumptions. The convergence rate is also considered and the worst-case complexity of $\mathcal{O} (\varepsilon^{-2})$ is proved. Numerical results for two types of problems, machine learning hinge loss and stochastic linear complementarity problems, show the efficiency of the proposed scheme.
We give algorithms for approximating the partition function of the ferromagnetic Potts model on $d$-regular expanding graphs. We require much weaker expansion than in previous works; for example, the expansion exhibited by the hypercube suffices. The main improvements come from a significantly sharper analysis of standard polymer models, using extremal graph theory and applications of Karger's algorithm to counting cuts that may be of independent interest. It is #BIS-hard to approximate the partition function at low temperatures on bounded-degree graphs, so our algorithm can be seen as evidence that hard instances of #BIS are rare. We believe that these methods can shed more light on other important problems such as sub-exponential algorithms for approximate counting problems.
Recently there is a large amount of work devoted to the study of Markov chain stochastic gradient methods (MC-SGMs) which mainly focus on their convergence analysis for solving minimization problems. In this paper, we provide a comprehensive generalization analysis of MC-SGMs for both minimization and minimax problems through the lens of algorithmic stability in the framework of statistical learning theory. For empirical risk minimization (ERM) problems, we establish the optimal excess population risk bounds for both smooth and non-smooth cases by introducing on-average argument stability. For minimax problems, we develop a quantitative connection between on-average argument stability and generalization error which extends the existing results for uniform stability \cite{lei2021stability}. We further develop the first nearly optimal convergence rates for convex-concave problems both in expectation and with high probability, which, combined with our stability results, show that the optimal generalization bounds can be attained for both smooth and non-smooth cases. To the best of our knowledge, this is the first generalization analysis of SGMs when the gradients are sampled from a Markov process.
This paper introduces a novel approach for the construction of bulk--surface splitting schemes for semi-linear parabolic partial differential equations with dynamic boundary conditions. The proposed construction is based on a reformulation of the system as a partial differential--algebraic equation and the inclusion of certain delay terms for the decoupling. To obtain a fully discrete scheme, the splitting approach is combined with finite elements in space and a BDF discretization in time. Within this paper, we focus on the second-order case, resulting in a $3$-step scheme. We prove second-order convergence under the assumption of a weak CFL-type condition and confirm the theoretical findings by numerical experiments. Moreover, we illustrate the potential for higher-order splitting schemes numerically.
Propose-Test-Release (PTR) is a differential privacy framework that works with local sensitivity of functions, instead of their global sensitivity. This framework is typically used for releasing robust statistics such as median or trimmed mean in a differentially private manner. While PTR is a common framework introduced over a decade ago, using it in applications such as robust SGD where we need many adaptive robust queries is challenging. This is mainly due to the lack of Renyi Differential Privacy (RDP) analysis, an essential ingredient underlying the moments accountant approach for differentially private deep learning. In this work, we generalize the standard PTR and derive the first RDP bound for it when the target function has bounded global sensitivity. We show that our RDP bound for PTR yields tighter DP guarantees than the directly analyzed $(\eps, \delta)$-DP. We also derive the algorithm-specific privacy amplification bound of PTR under subsampling. We show that our bound is much tighter than the general upper bound and close to the lower bound. Our RDP bounds enable tighter privacy loss calculation for the composition of many adaptive runs of PTR. As an application of our analysis, we show that PTR and our theoretical results can be used to design differentially private variants for byzantine robust training algorithms that use robust statistics for gradients aggregation. We conduct experiments on the settings of label, feature, and gradient corruption across different datasets and architectures. We show that PTR-based private and robust training algorithm significantly improves the utility compared with the baseline.
Decentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finite-time generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rate-of-consensus of DGD for a class of self-bounded losses. Finally, on the algorithmic front, we design improved gradient-based routines for decentralized learning with separable data and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.
We study differentially private (DP) stochastic optimization (SO) with data containing outliers and loss functions that are not Lipschitz continuous. To date, the vast majority of work on DP SO assumes that the loss is Lipschitz (i.e. stochastic gradients are uniformly bounded), and their error bounds scale with the Lipschitz parameter of the loss. While this assumption is convenient, it is often unrealistic: in many practical problems where privacy is required, data may contain outliers or be unbounded, causing some stochastic gradients to have large norm. In such cases, the Lipschitz parameter may be prohibitively large, leading to vacuous excess risk bounds. Thus, building on a recent line of work [WXDX20, KLZ22], we make the weaker assumption that stochastic gradients have bounded $k$-th moments for some $k \geq 2$. Compared with works on DP Lipschitz SO, our excess risk scales with the $k$-th moment bound instead of the Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). Moreover, in contrast to the prior works [WXDX20, KLZ22], our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm that runs in linear time and yields improved (compared to prior works) and nearly optimal excess risk for smooth losses. Additionally, our work is the first to address non-convex non-Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some classes of neural nets, among other practical models. Our Proximal-PL algorithm has nearly optimal excess risk that almost matches the strongly convex lower bound. Lastly, we provide shuffle DP variations of our algorithms, which do not require a trusted curator (e.g. for distributed learning).
Multiple-objective optimization (MOO) aims to simultaneously optimize multiple conflicting objectives and has found important applications in machine learning, such as minimizing classification loss and discrepancy in treating different populations for fairness. At optimality, further optimizing one objective will necessarily harm at least another objective, and decision-makers need to comprehensively explore multiple optima (called Pareto front) to pinpoint one final solution. We address the efficiency of finding the Pareto front. First, finding the front from scratch using stochastic multi-gradient descent (SMGD) is expensive with large neural networks and datasets. We propose to explore the Pareto front as a manifold from a few initial optima, based on a predictor-corrector method. Second, for each exploration step, the predictor solves a large-scale linear system that scales quadratically in the number of model parameters and requires one backpropagation to evaluate a second-order Hessian-vector product per iteration of the solver. We propose a Gauss-Newton approximation that only scales linearly, and that requires only first-order inner-product per iteration. This also allows for a choice between the MINRES and conjugate gradient methods when approximately solving the linear system. The innovations make predictor-corrector possible for large networks. Experiments on multi-objective (fairness and accuracy) misinformation detection tasks show that 1) the predictor-corrector method can find Pareto fronts better than or similar to SMGD with less time; and 2) the proposed first-order method does not harm the quality of the Pareto front identified by the second-order method, while further reduce running time.
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.