We analyse an iterative algorithm to minimize quadratic functions whose Hessian matrix $H$ is the expectation of a random symmetric $d\times d$ matrix. The algorithm is a variant of the stochastic variance reduced gradient (SVRG). In several applications, including least-squares regressions, ridge regressions, linear discriminant analysis and regularized linear discriminant analysis, the running time of each iteration is proportional to $d$. Under smoothness and convexity conditions, the algorithm has linear convergence. When applied to quadratic functions, our analysis improves the state-of-the-art performance of SVRG up to a logarithmic factor. Furthermore, for well-conditioned quadratic problems, our analysis improves the state-of-the-art running times of accelerated SVRG, and is better than the known matching lower bound, by a logarithmic factor. Our theoretical results are backed with numerical experiments.
We develop a communication-efficient distributed learning algorithm that is robust against Byzantine worker machines. We propose and analyze a distributed gradient-descent algorithm that performs a simple thresholding based on gradient norms to mitigate Byzantine failures. We show the (statistical) error-rate of our algorithm matches that of Yin et al.~\cite{dong}, which uses more complicated schemes (coordinate-wise median, trimmed mean). Furthermore, for communication efficiency, we consider a generic class of $\delta$-approximate compressors from Karimireddi et al.~\cite{errorfeed} that encompasses sign-based compressors and top-$k$ sparsification. Our algorithm uses compressed gradients and gradient norms for aggregation and Byzantine removal respectively. We establish the statistical error rate for non-convex smooth loss functions. We show that, in certain range of the compression factor $\delta$, the (order-wise) rate of convergence is not affected by the compression operation. Moreover, we analyze the compressed gradient descent algorithm with error feedback (proposed in \cite{errorfeed}) in a distributed setting and in the presence of Byzantine worker machines. We show that exploiting error feedback improves the statistical error rate. Finally, we experimentally validate our results and show good performance in convergence for convex (least-square regression) and non-convex (neural network training) problems.
The communication cost of distributed optimization algorithms is a major bottleneck in their scalability. This work considers a parameter-server setting in which the worker is constrained to communicate information to the server using only $R$ bits per dimension. We show that $\mathbf{democratic}$ $\mathbf{embeddings}$ from random matrix theory are significantly useful for designing efficient and optimal vector quantizers that respect this bit budget. The resulting polynomial complexity source coding schemes are used to design distributed optimization algorithms with convergence rates matching the minimax optimal lower bounds for (i) Smooth and Strongly-Convex objectives with access to an Exact Gradient oracle, as well as (ii) General Convex and Non-Smooth objectives with access to a Noisy Subgradient oracle. We further propose a relaxation of this coding scheme which is nearly minimax optimal. Numerical simulations validate our theoretical claims.
In this paper, we optimize user scheduling, power allocation and beamforming in distributed multiple-input multiple-output (MIMO) networks implementing user-centric clustering. We study both the coherent and non-coherent transmission modes, formulating a weighted sum rate maximization problem for each; finding the optimal solution to these problems is known to be NP-hard. We use tools from fractional programming, block coordinate descent, and compressive sensing to construct an algorithm that optimizes the beamforming weights and user scheduling and converges in a smooth non-decreasing pattern. Channel state information (CSI) being crucial for optimization, we highlight the importance of employing a low-overhead pilot assignment policy for scheduling problems. In this regard, we use a variant of hierarchical agglomerative clustering, which provides a suboptimal, but feasible, pilot assignment scheme; for our cell-free case, we formulate an area-based pilot reuse factor. Our results show that our scheme provides large gains in the long-term network sum spectral efficiency compared to benchmark schemes such as zero-forcing and conjugate beamforming (with round-robin scheduling) respectively. Furthermore, the results show the superiority of coherent transmission compared to the non-coherent mode under ideal and imperfect CSI for the area-based pilot-reuse factors we consider.
In this work, we study how to implement a distributed algorithm for the power method in a parallel manner. As the existing distributed power method is usually sequentially updating the eigenvectors, it exhibits two obvious disadvantages: 1) when it calculates the $h$th eigenvector, it needs to wait for the results of previous $(h-1)$ eigenvectors, which causes a delay in acquiring all the eigenvalues; 2) when calculating each eigenvector, it needs a certain cost of information exchange within the neighboring nodes for every power iteration, which could be unbearable when the number of eigenvectors or the number of nodes is large. This motivates us to propose a parallel distributed power method, which simultaneously calculates all the eigenvectors at each power iteration to ensure that more information could be exchanged in one shaking-hand of communication. We are particularly interested in the distributed power method for both an eigenvalue decomposition (EVD) and a singular value decomposition (SVD), wherein the distributed process is proceed based on a gossip algorithm. It can be shown that, under the same condition, the communication cost of the gossip-based parallel method is only $1/H$ times of that for the sequential counterpart, where $H$ is the number of eigenvectors we want to compute, while the convergence time and error performance of the proposed parallel method are both comparable to those of its sequential counterpart.
In the estimation of the mean matrix in a multivariate normal distribution, the generalized Bayes estimators with closed forms are provided, and the sufficient conditions for their minimaxity are derived relative to both matrix and scalar quadratic loss functions. The generalized Bayes estimators of the covariance matrix are also given with closed forms, and the dominance properties are discussed for the Stein loss function.
The convex body chasing problem, introduced by Friedman and Linial, is a competitive analysis problem on any normed vector space. In convex body chasing, for each timestep $t\in\mathbb N$, a convex body $K_t\subseteq \mathbb R^d$ is given as a request, and the player picks a point $x_t\in K_t$. The player aims to ensure that the total distance $\sum_{t=0}^{T-1}||x_t-x_{t+1}||$ is within a bounded ratio of the smallest possible offline solution. In this work, we consider the nested version of the problem, in which the sequence $(K_t)$ must be decreasing. For Euclidean spaces, we consider a memoryless algorithm which moves to the so-called Steiner point, and show that in a certain sense it is exactly optimal among memoryless algorithms. For general finite dimensional normed spaces, we combine the Steiner point and our recent previous algorithm to obtain a new algorithm which is nearly optimal for all $\ell^p_d$ spaces with $p\geq 1$, closing a polynomial gap.
The generalization error of a classifier is related to the complexity of the set of functions among which the classifier is chosen. Roughly speaking, the more complex the family, the greater the potential disparity between the training error and the population error of the classifier. This principle is embodied in layman's terms by Occam's razor principle, which suggests favoring low-complexity hypotheses over complex ones. We study a family of low-complexity classifiers consisting of thresholding the one-dimensional feature obtained by projecting the data on a random line after embedding it into a higher dimensional space parametrized by monomials of order up to k. More specifically, the extended data is projected n-times and the best classifier among those n (based on its performance on training data) is chosen. We obtain a bound on the generalization error of these low-complexity classifiers. The bound is less than that of any classifier with a non-trivial VC dimension, and thus less than that of a linear classifier. We also show that, given full knowledge of the class conditional densities, the error of the classifiers would converge to the optimal (Bayes) error as k and n go to infinity; if only a training dataset is given, we show that the classifiers will perfectly classify all the training points as k and n go to infinity.
In order to avoid the curse of dimensionality, frequently encountered in Big Data analysis, there was a vast development in the field of linear and nonlinear dimension reduction techniques in recent years. These techniques (sometimes referred to as manifold learning) assume that the scattered input data is lying on a lower dimensional manifold, thus the high dimensionality problem can be overcome by learning the lower dimensionality behavior. However, in real life applications, data is often very noisy. In this work, we propose a method to approximate $\mathcal{M}$ a $d$-dimensional $C^{m+1}$ smooth submanifold of $\mathbb{R}^n$ ($d \ll n$) based upon noisy scattered data points (i.e., a data cloud). We assume that the data points are located "near" the lower dimensional manifold and suggest a non-linear moving least-squares projection on an approximating $d$-dimensional manifold. Under some mild assumptions, the resulting approximant is shown to be infinitely smooth and of high approximation order (i.e., $O(h^{m+1})$, where $h$ is the fill distance and $m$ is the degree of the local polynomial approximation). The method presented here assumes no analytic knowledge of the approximated manifold and the approximation algorithm is linear in the large dimension $n$. Furthermore, the approximating manifold can serve as a framework to perform operations directly on the high dimensional data in a computationally efficient manner. This way, the preparatory step of dimension reduction, which induces distortions to the data, can be avoided altogether.
We propose accelerated randomized coordinate descent algorithms for stochastic optimization and online learning. Our algorithms have significantly less per-iteration complexity than the known accelerated gradient algorithms. The proposed algorithms for online learning have better regret performance than the known randomized online coordinate descent algorithms. Furthermore, the proposed algorithms for stochastic optimization exhibit as good convergence rates as the best known randomized coordinate descent algorithms. We also show simulation results to demonstrate performance of the proposed algorithms.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.