Principal Subspace Analysis (PSA) is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA solutions are fast becoming irrelevant in the modern era of big data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to study of distributed PSA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA in the case of data that are partitioned across either samples or (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study of effect of straggler machines on the proposed algorithms.
Majority-SAT is the problem of determining whether an input $n$-variable formula in conjunctive normal form (CNF) has at least $2^{n-1}$ satisfying assignments. Majority-SAT and related problems have been studied extensively in various AI communities interested in the complexity of probabilistic planning and inference. Although Majority-SAT has been known to be PP-complete for over 40 years, the complexity of a natural variant has remained open: Majority-$k$SAT, where the input CNF formula is restricted to have clause width at most $k$. We prove that for every $k$, Majority-$k$SAT is in P. In fact, for any positive integer $k$ and rational $\rho \in (0,1)$ with bounded denominator, we give an algorithm that can determine whether a given $k$-CNF has at least $\rho \cdot 2^n$ satisfying assignments, in deterministic linear time (whereas the previous best-known algorithm ran in exponential time). Our algorithms have interesting positive implications for counting complexity and the complexity of inference, significantly reducing the known complexities of related problems such as E-MAJ-$k$SAT and MAJ-MAJ-$k$SAT. At the heart of our approach is an efficient method for solving threshold counting problems by extracting sunflowers found in the corresponding set system of a $k$-CNF. We also show that the tractability of Majority-$k$SAT is somewhat fragile. For the closely related GtMajority-SAT problem (where we ask whether a given formula has greater than $2^{n-1}$ satisfying assignments) which is known to be PP-complete, we show that GtMajority-$k$SAT is in P for $k\le 3$, but becomes NP-complete for $k\geq 4$. These results are counterintuitive, because the ``natural'' classifications of these problems would have been PP-completeness, and because there is a stark difference in the complexity of GtMajority-$k$SAT and Majority-$k$SAT for all $k\ge 4$.
This paper studies the expressive power of artificial neural networks (NNs) with rectified linear units. To study them as a model of real-valued computation, we introduce the concept of Max-Affine Arithmetic Programs and show equivalence between them and NNs concerning natural complexity measures. We then use this result to show that two fundamental combinatorial optimization problems can be solved with polynomial-size NNs, which is equivalent to the existence of very special strongly polynomial time algorithms. First, we show that for any undirected graph with $n$ nodes, there is an NN of size $\mathcal{O}(n^3)$ that takes the edge weights as input and computes the value of a minimum spanning tree of the graph. Second, we show that for any directed graph with $n$ nodes and $m$ arcs, there is an NN of size $\mathcal{O}(m^2n^2)$ that takes the arc capacities as input and computes a maximum flow. These results imply in particular that the solutions of the corresponding parametric optimization problems where all edge weights or arc capacities are free parameters can be encoded in polynomial space and evaluated in polynomial time, and that such an encoding is provided by an NN.
One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions. While this paradigm has inspired significant research on the properties of large networks, relatively little work has been devoted to the fact that these networks are often used to model large complex datasets, which may themselves contain millions or even billions of constraints. In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity. We analyze the performance of random feature regression with features $F=f(WX+B)$ for a random weight matrix $W$ and random bias vector $B$, obtaining exact formulae for the asymptotic training and test errors for data generated by a linear teacher model. The role of the bias can be understood as parameterizing a distribution over activation functions, and our analysis directly generalizes to such distributions, even those not expressible with a traditional additive bias. Intriguingly, we find that a mixture of nonlinearities can improve both the training and test errors over the best single nonlinearity, suggesting that mixtures of nonlinearities might be useful for approximate kernel methods or neural network architecture design.
We study sparse linear regression over a network of agents, modeled as an undirected graph (with no centralized node). The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint -- the latter being instrumental to obtain distributed solution methods. While penalty-based consensus methods have been extensively studied in the optimization literature, their statistical and computational guarantees in the high dimensional setting remain unclear. This work provides an answer to this open problem. Our contribution is two-fold. First, we establish statistical consistency of the estimator: under a suitable choice of the penalty parameter, the optimal solution of the penalized problem achieves near optimal minimax rate $\mathcal{O}(s \log d/N)$ in $\ell_2$-loss, where $s$ is the sparsity value, $d$ is the ambient dimension, and $N$ is the total sample size in the network -- this matches centralized sample rates. Second, we show that the proximal-gradient algorithm applied to the penalized problem, which naturally leads to distributed implementations, converges linearly up to a tolerance of the order of the centralized statistical error -- the rate scales as $\mathcal{O}(d)$, revealing an unavoidable speed-accuracy dilemma.Numerical results demonstrate the tightness of the derived sample rate and convergence rate scalings.
We prove several new tight distributed lower bounds for classic symmetry breaking graph problems. As a basic tool, we first provide a new insightful proof that any deterministic distributed algorithm that computes a $\Delta$-coloring on $\Delta$-regular trees requires $\Omega(\log_\Delta n)$ rounds and any randomized algorithm requires $\Omega(\log_\Delta\log n)$ rounds. We prove this result by showing that a natural relaxation of the $\Delta$-coloring problem is a fixed point in the round elimination framework. As a first application, we show that our $\Delta$-coloring lower bound proof directly extends to arbdefective colorings. We exactly characterize which variants of the arbdefective coloring problem are "easy", and which of them instead are "hard". As a second application, which we see as our main contribution, we use the structure of the fixed point as a building block to prove lower bounds as a function of $\Delta$ for a large class of distributed symmetry breaking problems. For example, we obtain a tight lower bound for the fundamental problem of computing a $(2,\beta)$-ruling set. This is an exponential improvement over the best existing lower bound for the problem, which was proven in [FOCS '20]. Our lower bound even applies to a much more general family of problems that allows for almost arbitrary combinations of natural constraints from coloring problems, orientation problems, and independent set problems, and provides a single unified proof for known and new lower bound results for these types of problems. Our lower bounds as a function of $\Delta$ also imply lower bounds as a function of $n$. We obtain, for example, that maximal independent set, on trees, requires $\Omega(\log n / \log \log n)$ rounds for deterministic algorithms, which is tight.
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.
While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent work has shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al.~\cite{AroraGeLiMaZh17} suggests a dilemma about GANs' statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse. In contrast, we show in this paper that GANs can in principle learn distributions in Wasserstein distance (or KL-divergence in many cases) with polynomial sample complexity, if the discriminator class has strong distinguishing power against the particular generator class (instead of against all possible generators). For various generator classes such as mixture of Gaussians, exponential families, and invertible neural networks generators, we design corresponding discriminators (which are often neural nets of specific architectures) such that the Integral Probability Metric (IPM) induced by the discriminators can provably approximate the Wasserstein distance and/or KL-divergence. This implies that if the training is successful, then the learned distribution is close to the true distribution in Wasserstein distance or KL divergence, and thus cannot drop modes. Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence, indicating that the lack of diversity may be caused by the sub-optimality in optimization instead of statistical inefficiency.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
The field of Multi-Agent System (MAS) is an active area of research within Artificial Intelligence, with an increasingly important impact in industrial and other real-world applications. Within a MAS, autonomous agents interact to pursue personal interests and/or to achieve common objectives. Distributed Constraint Optimization Problems (DCOPs) have emerged as one of the prominent agent architectures to govern the agents' autonomous behavior, where both algorithms and communication models are driven by the structure of the specific problem. During the last decade, several extensions to the DCOP model have enabled them to support MAS in complex, real-time, and uncertain environments. This survey aims at providing an overview of the DCOP model, giving a classification of its multiple extensions and addressing both resolution methods and applications that find a natural mapping within each class of DCOPs. The proposed classification suggests several future perspectives for DCOP extensions, and identifies challenges in the design of efficient resolution algorithms, possibly through the adaptation of strategies from different areas.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.