We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $\kappa$, we show that MALA with a warm start mixes in $\tilde O(\kappa \sqrt{d})$ iterations up to logarithmic factors. This improves upon the previous work on the dependency of either the condition number $\kappa$ or the dimension $d$. Our proof relies on comparing the leapfrog integrator with the continuous Hamiltonian dynamics, where we establish a new concentration bound for the acceptance rate. Second, we prove a spectral gap based mixing time lower bound for reversible MCMC algorithms on general state spaces. We apply this lower bound result to construct a hard distribution for which MALA requires at least $\tilde \Omega (\kappa \sqrt{d})$ steps to mix. The lower bound for MALA matches our upper bound in terms of condition number and dimension. Finally, numerical experiments are included to validate our theoretical results.
We study the problem of estimating the size of maximum matching and minimum vertex cover in sublinear time. Denoting the number of vertices by $n$ and the average degree in the graph by $\bar{d}$, we obtain the following results for both problems: * A multiplicative $(2+\epsilon)$-approximation that takes $\tilde{O}(n/\epsilon^2)$ time using adjacency list queries. * A multiplicative-additive $(2, \epsilon n)$-approximation in $\tilde{O}((\bar{d} + 1)/\epsilon^2)$ time using adjacency list queries. * A multiplicative-additive $(2, \epsilon n)$-approximation in $\tilde{O}(n/\epsilon^{3})$ time using adjacency matrix queries. All three results are provably time-optimal up to polylogarithmic factors culminating a long line of work on these problems. Our main contribution and the key ingredient leading to the bounds above is a new and near-tight analysis of the average query complexity of the randomized greedy maximal matching algorithm which improves upon a seminal result of Yoshida, Yamamoto, and Ito [STOC'09].
We prove an optimal $\Omega(n^{-1})$ lower bound on the spectral gap of Glauber dynamics for anti-ferromagnetic two-spin systems with $n$ vertices in the tree uniqueness regime. This spectral gap holds for all, including unbounded, maximum degree $\Delta$. Consequently, we have the following mixing time bounds for the models satisfying the uniqueness condition with a slack $\delta\in(0,1)$: $\bullet$ $C(\delta) n^2\log n$ mixing time for the hardcore model with fugacity $\lambda\le (1-\delta)\lambda_c(\Delta)= (1-\delta)\frac{(\Delta - 1)^{\Delta - 1}}{(\Delta - 2)^\Delta}$; $\bullet$ $C(\delta) n^2$ mixing time for the Ising model with edge activity $\beta\in\left[\frac{\Delta-2+\delta}{\Delta-\delta},\frac{\Delta-\delta}{\Delta-2+\delta}\right]$; where the maximum degree $\Delta$ may depend on the number of vertices $n$, and $C(\delta)$ depends only on $\delta$. Our proof is built upon the recently developed connections between the Glauber dynamics for spin systems and the high-dimensional expander walks. In particular, we prove a stronger notion of spectral independence, called the complete spectral independence, and use a novel Markov chain called the field dynamics to connect this stronger spectral independence to the rapid mixing of Glauber dynamics for all degrees.
We show that the natural Glauber dynamics mixes rapidly and generates a random proper edge-coloring of a graph with maximum degree $\Delta$ whenever the number of colors is at least $q\geq (\frac{10}{3} + \epsilon)\Delta$, where $\epsilon>0$ is arbitrary and the maximum degree satisfies $\Delta \geq C$ for a constant $C = C(\epsilon)$ depending only on $\epsilon$. For edge-colorings, this improves upon prior work \cite{Vig99, CDMPP19} which show rapid mixing when $q\geq (\frac{11}{3}-\epsilon_0 ) \Delta$, where $\epsilon_0 \approx 10^{-5}$ is a small fixed constant. At the heart of our proof, we establish a matrix trickle-down theorem, generalizing Oppenheim's influential result, as a new technique to prove that a high dimensional simplical complex is a local spectral expander.
We consider nonconvex-concave minimax problems, $\min_{\mathbf{x}} \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x}, \mathbf{y})$, where $f$ is nonconvex in $\mathbf{x}$ but concave in $\mathbf{y}$ and $\mathcal{Y}$ is a convex and bounded set. One of the most popular algorithms for solving this problem is the celebrated gradient descent ascent (GDA) algorithm, which has been widely used in machine learning, control theory and economics. Despite the extensive convergence results for the convex-concave setting, GDA with equal stepsize can converge to limit cycles or even diverge in a general setting. In this paper, we present the complexity results on two-time-scale GDA for solving nonconvex-concave minimax problems, showing that the algorithm can find a stationary point of the function $\Phi(\cdot) := \max_{\mathbf{y} \in \mathcal{Y}} f(\cdot, \mathbf{y})$ efficiently. To the best our knowledge, this is the first nonasymptotic analysis for two-time-scale GDA in this setting, shedding light on its superior practical performance in training generative adversarial networks (GANs) and other real applications.
The asymptotic behaviour of Linear Spectral Statistics (LSS) of the smoothed periodogram estimator of the spectral coherency matrix of a complex Gaussian high-dimensional time series $(\y_n)_{n \in \mathbb{Z}}$ with independent components is studied under the asymptotic regime where the sample size $N$ converges towards $+\infty$ while the dimension $M$ of $\y$ and the smoothing span of the estimator grow to infinity at the same rate in such a way that $\frac{M}{N} \rightarrow 0$. It is established that, at each frequency, the estimated spectral coherency matrix is close from the sample covariance matrix of an independent identically $\mathcal{N}_{\mathbb{C}}(0,\I_M)$ distributed sequence, and that its empirical eigenvalue distribution converges towards the Marcenko-Pastur distribution. This allows to conclude that each LSS has a deterministic behaviour that can be evaluated explicitly. Using concentration inequalities, it is shown that the order of magnitude of the supremum over the frequencies of the deviation of each LSS from its deterministic approximation is of the order of $\frac{1}{M} + \frac{\sqrt{M}}{N}+ (\frac{M}{N})^{3}$ where $N$ is the sample size. Numerical simulations supports our results.
We investigate the equilibrium behavior for the decentralized cheap talk problem for real random variables and quadratic cost criteria in which an encoder and a decoder have misaligned objective functions. In prior work, it has been shown that the number of bins in any equilibrium has to be countable, generalizing a classical result due to Crawford and Sobel who considered sources with density supported on $[0,1]$. In this paper, we first refine this result in the context of log-concave sources. For sources with two-sided unbounded support, we prove that, for any finite number of bins, there exists a unique equilibrium. In contrast, for sources with semi-unbounded support, there may be a finite upper bound on the number of bins in equilibrium depending on certain conditions stated explicitly. Moreover, we prove that for log-concave sources, the expected costs of the encoder and the decoder in equilibrium decrease as the number of bins increases. Furthermore, for strictly log-concave sources with two-sided unbounded support, we prove convergence to the unique equilibrium under best response dynamics which starts with a given number of bins, making a connection with the classical theory of optimal quantization and convergence results of Lloyd's method. In addition, we consider more general sources which satisfy certain assumptions on the tail(s) of the distribution and we show that there exist equilibria with infinitely many bins for sources with two-sided unbounded support. Further explicit characterizations are provided for sources with exponential, Gaussian, and compactly-supported probability distributions.
We present a novel method for reducing the computational complexity of rigorously estimating the partition functions (normalizing constants) of Gibbs (Boltzmann) distributions, which arise ubiquitously in probabilistic graphical models. A major obstacle to practical applications of Gibbs distributions is the need to estimate their partition functions. The state of the art in addressing this problem is multi-stage algorithms, which consist of a cooling schedule, and a mean estimator in each step of the schedule. While the cooling schedule in these algorithms is adaptive, the mean estimation computations use MCMC as a black-box to draw approximate samples. We develop a doubly adaptive approach, combining the adaptive cooling schedule with an adaptive MCMC mean estimator, whose number of Markov chain steps adapts dynamically to the underlying chain. Through rigorous theoretical analysis, we prove that our method outperforms the state of the art algorithms in several factors: (1) The computational complexity of our method is smaller; (2) Our method is less sensitive to loose bounds on mixing times, an inherent component in these algorithms; and (3) The improvement obtained by our method is particularly significant in the most challenging regime of high-precision estimation. We demonstrate the advantage of our method in experiments run on classic factor graphs, such as voting models and Ising models.
We study sparse linear regression over a network of agents, modeled as an undirected graph (with no centralized node). The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint -- the latter being instrumental to obtain distributed solution methods. While penalty-based consensus methods have been extensively studied in the optimization literature, their statistical and computational guarantees in the high dimensional setting remain unclear. This work provides an answer to this open problem. Our contribution is two-fold. First, we establish statistical consistency of the estimator: under a suitable choice of the penalty parameter, the optimal solution of the penalized problem achieves near optimal minimax rate $\mathcal{O}(s \log d/N)$ in $\ell_2$-loss, where $s$ is the sparsity value, $d$ is the ambient dimension, and $N$ is the total sample size in the network -- this matches centralized sample rates. Second, we show that the proximal-gradient algorithm applied to the penalized problem, which naturally leads to distributed implementations, converges linearly up to a tolerance of the order of the centralized statistical error -- the rate scales as $\mathcal{O}(d)$, revealing an unavoidable speed-accuracy dilemma.Numerical results demonstrate the tightness of the derived sample rate and convergence rate scalings.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.