Decentralized optimization and communication compression have exhibited their great potential in accelerating distributed machine learning by mitigating the communication bottleneck in practice. While existing decentralized algorithms with communication compression mostly focus on the problems with only smooth components, we study the decentralized stochastic composite optimization problem with a potentially non-smooth component. A \underline{Prox}imal gradient \underline{L}in\underline{EA}r convergent \underline{D}ecentralized algorithm with compression, Prox-LEAD, is proposed with rigorous theoretical analyses in the general stochastic setting and the finite-sum setting. Our theorems indicate that Prox-LEAD works with arbitrary compression precision, and it tremendously reduces the communication cost almost for free. The superiorities of the proposed algorithms are demonstrated through the comparison with state-of-the-art algorithms in terms of convergence complexities and numerical experiments. Our algorithmic framework also generally enlightens the compressed communication on other primal-dual algorithms by reducing the impact of inexact iterations, which might be of independent interest.
We analyze the high-SNR regime of the MxK Network MISO channel in which each transmitter has access to a different channel estimate, possibly with different precision. It has been recently shown that, for some regimes, this setting attains the same Degrees-of-Freedom as the ideal centralized setting with perfect CSI sharing, in which all the transmitters are endowed with the best estimate available at any transmitter. This result is restricted by the limitations of the DoF metric, as it only provides information about the slope of growth of the capacity as a function of the SNR, without any insight about the possible performance at a given SNR. To overcome this limitation, we analyze the affine approximation of the rate on the high-SNR regime for this decentralized Network MISO setting for the antenna configurations in which it achieves the DoF of the centralized setting. We show that, for a regime of antenna configurations, it is possible to asymptotically attain the same achievable rate as in the ideal centralized scenario. Consequently, it is possible to achieve the beamforming gain of the ideal perfect-CSI-sharing setting even if only a subset of transmitters is endowed with precise CSI, which can be exploited in scenarios such as distributed massive MIMO where the number of transmit antennas is much bigger than the number of served users. This outcome is a consequence of the synergistic compromise between CSI precision at the transmitters and consistency between the locally-computed precoders, which is an inherent trade-off of decentralized settings that does not exist in a centralized CSI configuration. We propose a precoding scheme achieving the previous result, which is built on an uneven structure in which some transmitters reduce the precision of their own precoding for the sake of using transmission parameters that can be more easily predicted by the other transmitters.
In this paper, we propose and analyze SQuARM-SGD, a communication-efficient algorithm for decentralized training of large-scale machine learning models over a network. In SQuARM-SGD, each node performs a fixed number of local SGD steps using Nesterov's momentum and then sends sparsified and quantized updates to its neighbors regulated by a locally computable triggering criterion. We provide convergence guarantees of our algorithm for general (non-convex) and convex smooth objectives, which, to the best of our knowledge, is the first theoretical analysis for compressed decentralized SGD with momentum updates. We show that the convergence rate of SQuARM-SGD matches that of vanilla SGD. We empirically show that including momentum updates in SQuARM-SGD can lead to better test performance than the current state-of-the-art which does not consider momentum updates.
We consider distributed stochastic variational inequalities (VIs) on unbounded domain with the problem data being heterogeneous (non-IID) and distributed across many devices. We make very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated Learning. Moreover, multiple local updates on the workers can be made for reducing the communication frequency between workers. We extend stochastic extragradient method to this very general setting and theoretically analyze its convergence rate in the strongly monotone, monotone, and non-monotone setting when an Minty solution exists. The provided rates have explicit dependence on\ network characteristics and how it varies with time, data heterogeneity, variance, number of devices, and other standard parameters. As a special case, our method and analysis apply to distributed stochastic saddle-point problems (SPP), e.g., to training Deep Generative Adversarial Networks (GANs) for which the decentralized training has been reported to be extremely challenging. In experiments for decentralized training of GANs we demonstrate the effectiveness of our proposed approach.
This paper considers decentralized optimization with application to machine learning on graphs. The growing size of neural network (NN) models has motivated prior works on decentralized stochastic gradient algorithms to incorporate communication compression. On the other hand, recent works have demonstrated the favorable convergence and generalization properties of overparameterized NNs. In this work, we present an empirical analysis on the performance of compressed decentralized stochastic gradient (DSG) algorithms with overparameterized NNs. Through simulations on an MPI network environment, we observe that the convergence rates of popular compressed DSG algorithms are robust to the size of NNs. Our findings suggest a gap between theories and practice of the compressed DSG algorithms in the existing literature.
There has been a growing need to provide Byzantine-resilience in distributed model training. Existing robust distributed learning algorithms focus on developing sophisticated robust aggregators at the parameter servers, but pay less attention to balancing the communication cost and robustness. In this paper, we propose Solon, an algorithmic framework that exploits gradient redundancy to provide communication efficiency and Byzantine robustness simultaneously. Our theoretical analysis shows a fundamental trade-off among computational load, communication cost, and Byzantine robustness. We also develop a concrete algorithm to achieve the optimal trade-off, borrowing ideas from coding theory and sparse recovery. Empirical experiments on various datasets demonstrate that Solon provides significant speedups over existing methods to achieve the same accuracy, over 10 times faster than Bulyan and 80% faster than Draco. We also show that carefully designed Byzantine attacks break Signum and Bulyan, but do not affect the successful convergence of Solon.
Harnessing a block-sparse prior to recover signals through underdetermined linear measurements has been extensively shown to allow exact recovery in conditions where classical compressed sensing would provably fail. We exploit this result to propose a novel private communication framework where the secrecy is achieved by transmitting instances of an unidentifiable compressed sensing problem over a public channel. The legitimate receiver can attempt to overcome this ill-posedness by leveraging secret knowledge of a block structure that was used to encode the transmitter's message. We study the privacy guarantees of this communication protocol to a single transmission, and to multiple transmissions without refreshing the shared secret. Additionally, we propose an algorithm for an eavesdropper to learn the block structure via the method of moments and highlight the privacy benefits of this framework through numerical experiments.
Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across these and other applications, it is necessary to rely on parallel and distributed computing. However, in distributed training, communication among the compute nodes is a key bottleneck during training, and this problem is exacerbated for high dimensional and over-parameterized models models. Due to these considerations, it is important to equip existing methods with strategies that would allow to reduce the volume of transmitted information during training while obtaining a model of comparable quality. In this paper, we present the first theoretically grounded distributed methods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2. Our theory and methods allow for the use of both unbiased (such as Rand$k$; MASHA1) and contractive (such as Top$k$; MASHA2) compressors. We empirically validate our conclusions using two experimental setups: a standard bilinear min-max problem, and large-scale distributed adversarial training of transformers.
We study the MARINA method of Gorbunov et al (2021) -- the current state-of-the-art distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and the reliance on {\em independent} stochastic communication compression operators, which leads to a reduction in the number of transmitted bits within each communication round. In this paper we i) extend the theory of MARINA to support a much wider class of potentially {\em correlated} compressors, extending the reach of the method beyond the classical independent compressors setting, ii) show that a new quantity, for which we coin the name {\em Hessian variance}, allows us to significantly refine the original analysis of MARINA without any additional assumptions, and iii) identify a special class of correlated compressors based on the idea of {\em random permutations}, for which we coin the term Perm$K$, the use of which leads to $O(\sqrt{n})$ (resp. $O(1 + d/\sqrt{n})$) improvement in the theoretical communication complexity of MARINA in the low Hessian variance regime when $d\geq n$ (resp. $d \leq n$), where $n$ is the number of workers and $d$ is the number of parameters describing the model we are learning. We corroborate our theoretical results with carefully engineered synthetic experiments with minimizing the average of nonconvex quadratics, and on autoencoder training with the MNIST dataset.
Learning the structure of a causal graphical model using both observational and interventional data is a fundamental problem in many scientific fields. A promising direction is continuous optimization for score-based methods, which efficiently learn the causal graph in a data-driven manner. However, to date, those methods require constrained optimization to enforce acyclicity or lack convergence guarantees. In this paper, we present ENCO, an efficient structure learning method for directed, acyclic causal graphs leveraging observational and interventional data. ENCO formulates the graph search as an optimization of independent edge likelihoods, with the edge orientation being modeled as a separate parameter. Consequently, we can provide convergence guarantees of ENCO under mild conditions without constraining the score function with respect to acyclicity. In experiments, we show that ENCO can efficiently recover graphs with hundreds of nodes, an order of magnitude larger than what was previously possible, while handling deterministic variables and latent confounders.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.