In this paper, we study the connection between entropic optimal transport and entropy power inequality (EPI). First, we prove an HWI-type inequality making use of the infinitesimal displacement convexity of optimal transport map. Second, we derive two Talagrand-type inequalities using the saturation of EPI that corresponds to a numerical term in our expression. We evaluate for a wide variety of distributions this term whereas for Gaussian and i.i.d. Cauchy distributions this term is found in explicit form. We show that our results extend previous results of Gaussian Talagrand inequality for Sinkhorn distance to the strongly log-concave case.
Numerous studies have demonstrated that deep neural networks are easily misled by adversarial examples. Effectively evaluating the adversarial robustness of a model is important for its deployment in practical applications. Currently, a common type of evaluation is to approximate the adversarial risk of a model as a robustness indicator by constructing malicious instances and executing attacks. Unfortunately, there is an error (gap) between the approximate value and the true value. Previous studies manually design attack methods to achieve a smaller error, which is inefficient and may miss a better solution. In this paper, we establish the tightening of the approximation error as an optimization problem and try to solve it with an algorithm. More specifically, we first analyze that replacing the non-convex and discontinuous 0-1 loss with a surrogate loss, a necessary compromise in calculating the approximation, is one of the main reasons for the error. Then we propose AutoLoss-AR, the first method for searching loss functions for tightening the approximation error of adversarial risk. Extensive experiments are conducted in multiple settings. The results demonstrate the effectiveness of the proposed method: the best-discovered loss functions outperform the handcrafted baseline by 0.9%-2.9% and 0.7%-2.0% on MNIST and CIFAR-10, respectively. Besides, we also verify that the searched losses can be transferred to other settings and explore why they are better than the baseline by visualizing the local loss landscape.
Inverse reinforcement learning attempts to reconstruct the reward function in a Markov decision problem, using observations of agent actions. As already observed in Russell [1998] the problem is ill-posed, and the reward function is not identifiable, even under the presence of perfect information about optimal behavior. We provide a resolution to this non-identifiability for problems with entropy regularization. For a given environment, we fully characterize the reward functions leading to a given policy and demonstrate that, given demonstrations of actions for the same reward under two distinct discount factors, or under sufficiently different environments, the unobserved reward can be recovered up to a constant. We also give general necessary and sufficient conditions for reconstruction of time-homogeneous rewards on finite horizons, and for action-independent rewards, generalizing recent results of Kim et al. [2021] and Fu et al. [2018].
In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We derive theoretical guarantees for the density estimation with GANs under a proper choice of the deep neural networks classes representing generators and discriminators. In particular, we prove that the resulting estimate converges to the true density $\mathsf{p}^*$ in terms of Jensen-Shannon (JS) divergence at the rate $(\log{n}/n)^{2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. To the best of our knowledge, this is the first result in the literature on density estimation using vanilla GANs with JS convergence rates faster than $n^{-1/2}$ in the regime $\beta > d/2$. Moreover, we show that the obtained rate is minimax optimal (up to logarithmic factors) for the considered class of densities.
We study combinatorial inequalities for various classes of set systems: matroids, polymatroids, poset antimatroids, and interval greedoids. We prove log-concavity inequalities for counting certain weighted feasible words, which generalize and extend several previous results establishing Mason conjectures for the numbers of independent sets of matroids. Notably, we prove matching equality conditions for both earlier inequalities and our extensions. In contrast with much of the previous work, our proofs are combinatorial and employ nothing but linear algebra. We use the language formulation of greedoids which allows a linear algebraic setup, which in turn can be analyzed recursively. The underlying non-commutative nature of matrices associated with greedoids allows us to proceed beyond polymatroids and prove the equality conditions. As further application of our tools, we rederive both Stanley's inequality on the number of certain linear extensions, and its equality conditions, which we then also extend to the weighted case.
We introduce a framework for obtaining tight mixing times for Markov chains based on what we call restricted modified log-Sobolev inequalities. Modified log-Sobolev inequalities (MLSI) quantify the rate of relative entropy contraction for the Markov operator, and are notoriously difficult to establish. However, infinitesimally close to stationarity, entropy contraction becomes equivalent to variance contraction, a.k.a. a Poincare inequality, which is significantly easier to establish through, e.g., spectral analysis. Motivated by this observation, we study restricted modified log-Sobolev inequalities that guarantee entropy contraction not for all starting distributions, but for those in a large neighborhood of the stationary distribution. We show how to sample from the hardcore and Ising models on $n$-node graphs that have a constant $\delta$ relative gap to the tree-uniqueness threshold, in nearly-linear time $\widetilde O_{\delta}(n)$. Notably, our bound does not depend on the maximum degree $\Delta$, and is therefore optimal even for high-degree graphs. This improves on prior mixing time bounds of $\widetilde O_{\delta, \Delta}(n)$ and $\widetilde O_{\delta}(n^2)$, established via (non-restricted) modified log-Sobolev and Poincare inequalities respectively. We further show that optimal concentration inequalities can still be achieved from the restricted form of modified log-Sobolev inequalities. To establish restricted entropy contraction, we extend the entropic independence framework of Anari, Jain, Koehler, Pham, and Vuong to the setting of distributions that are spectrally independent under a restricted set of external fields. We also develop an orthogonal trick that might be of independent interest: utilizing Bernoulli factories we show how to implement Glauber dynamics updates on high-degree graphs in $O(1)$ time, assuming standard adjacency array representation of the graph.
We introduce a notion called entropic independence that is an entropic analog of spectral notions of high-dimensional expansion. Informally, entropic independence of a background distribution $\mu$ on $k$-sized subsets of a ground set of elements says that for any (possibly randomly chosen) set $S$, the relative entropy of a single element of $S$ drawn uniformly at random carries at most $O(1/k)$ fraction of the relative entropy of $S$. Entropic independence is the analog of the notion of spectral independence, if one replaces variance by entropy. We use entropic independence to derive tight mixing time bounds, overcoming the lossy nature of spectral analysis of Markov chains on exponential-sized state spaces. In our main technical result, we show a general way of deriving entropy contraction, a.k.a. modified log-Sobolev inequalities, for down-up random walks from spectral notions. We show that spectral independence of a distribution under arbitrary external fields automatically implies entropic independence. To derive our results, we relate entropic independence to properties of polynomials: $\mu$ is entropically independent exactly when a transformed version of the generating polynomial of $\mu$ is upper bounded by its linear tangent; this property is implied by concavity of the said transformation, which was shown by prior work to be locally equivalent to spectral independence. We apply our results to obtain tight modified log-Sobolev inequalities and mixing times for multi-step down-up walks on fractionally log-concave distributions. As our flagship application, we establish the tight mixing time of $O(n\log n)$ for Glauber dynamics on Ising models whose interaction matrix has eigenspectrum lying within an interval of length smaller than $1$, improving upon the prior quadratic dependence on $n$.
Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.
While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent work has shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al.~\cite{AroraGeLiMaZh17} suggests a dilemma about GANs' statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse. In contrast, we show in this paper that GANs can in principle learn distributions in Wasserstein distance (or KL-divergence in many cases) with polynomial sample complexity, if the discriminator class has strong distinguishing power against the particular generator class (instead of against all possible generators). For various generator classes such as mixture of Gaussians, exponential families, and invertible neural networks generators, we design corresponding discriminators (which are often neural nets of specific architectures) such that the Integral Probability Metric (IPM) induced by the discriminators can provably approximate the Wasserstein distance and/or KL-divergence. This implies that if the training is successful, then the learned distribution is close to the true distribution in Wasserstein distance or KL divergence, and thus cannot drop modes. Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence, indicating that the lack of diversity may be caused by the sub-optimality in optimization instead of statistical inefficiency.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.