We consider \emph{Gibbs distributions}, which are families of probability distributions over a discrete space $\Omega$ with probability mass function of the form $\mu^\Omega_\beta(\omega) \propto e^{\beta H(\omega)}$ for $\beta$ in an interval $[\beta_{\min}, \beta_{\max}]$ and $H( \omega ) \in \{0 \} \cup [1, n]$. The \emph{partition function} is the normalization factor $Z(\beta)=\sum_{\omega \in\Omega}e^{\beta H(\omega)}$. Two important parameters of these distributions are the log partition ratio $q = \log \tfrac{Z(\beta_{\max})}{Z(\beta_{\min})}$ and the counts $c_x = |H^{-1}(x)|$. These are correlated with system parameters in a number of physical applications and sampling algorithms. Our first main result is to estimate the counts $c_x$ using roughly $\tilde O( \frac{q}{\varepsilon^2})$ samples for general Gibbs distributions and $\tilde O( \frac{n^2}{\varepsilon^2} )$ samples for integer-valued distributions (ignoring some second-order terms and parameters), and we show this is optimal up to logarithmic factors. We illustrate with improved algorithms for counting connected subgraphs and perfect matchings in a graph. We develop a key subroutine to estimate the partition function $Z$. Specifically, it generates a data structure to estimate $Z(\beta)$ for \emph{all} values $\beta$, without further samples. Constructing the data structure requires $O(\frac{q \log n}{\varepsilon^2})$ samples for general Gibbs distributions and $O(\frac{n^2 \log n}{\varepsilon^2} + n \log q)$ samples for integer-valued distributions. This improves over a prior algorithm of Huber (2015) which computes a single point estimate $Z(\beta_\max)$ using $O( q \log n( \log q + \log \log n + \varepsilon^{-2}))$ samples. We show matching lower bounds, demonstrating that this complexity is optimal as a function of $n$ and $q$ up to logarithmic terms.
It was recently shown that under smoothness conditions, the squared Wasserstein distance between two distributions could be efficiently computed with appealing statistical error upper bounds. However, rather than the distance itself, the object of interest for applications such as generative modeling is the underlying optimal transport map. Hence, computational and statistical guarantees need to be obtained for the estimated maps themselves. In this paper, we propose the first tractable algorithm for which the statistical $L^2$ error on the maps nearly matches the existing minimax lower-bounds for smooth map estimation. Our method is based on solving the semi-dual formulation of optimal transport with an infinite-dimensional sum-of-squares reformulation, and leads to an algorithm which has dimension-free polynomial rates in the number of samples, with potentially exponentially dimension-dependent constants.
This study concerns probability distribution estimation of sample maximum. The traditional approach is the parametric fitting to the limiting distribution - the generalized extreme value distribution; however, the model in finite cases is misspecified to a certain extent. We propose a plug-in type of the kernel distribution estimator which does not need model specification. It is proved that both asymptotic convergence rates depend on the tail index and the second order parameter. As the tail gets light, the degree of misspecification of the parametric fitting becomes large, that means the convergence rate becomes slow. In the Weibull cases, which can be seen as the limit of tail-lightness, only the nonparametric distribution estimator keeps its consistency. Finally, we report results of numerical experiments and two real case studies.
Let $Q_{n}^{r}$ be the graph with vertex set $\{-1,1\}^{n}$ in which two vertices are joined if their Hamming distance is at most $r$. The edge-isoperimetric problem for $Q_{n}^{r}$ is that: For every $(n,r,M)$ such that $1\le r\le n$ and $1\le M\le2^{n}$, determine the minimum edge-boundary size of a subset of vertices of $Q_{n}^{r}$ with a given size $M$. In this paper, we apply two different approaches to prove bounds for this problem. The first approach is a linear programming approach and the second is a probabilistic approach. Our bound derived by the first approach generalizes the tight bound for $M=2^{n-1}$ derived by Kahn, Kalai, and Linial in 1989. Moreover, our bound is also tight for $M=2^{n-2}$ and $r\le\frac{n}{2}-1$. Our bounds derived by the second approach are expressed in terms of the \emph{noise stability}, and they are shown to be asymptotically tight as $n\to\infty$ when $r=2\lfloor\frac{\beta n}{2}\rfloor+1$ and $M=\lfloor\alpha2^{n}\rfloor$ for fixed $\alpha,\beta\in(0,1)$, and is tight up to a factor $2$ when $r=2\lfloor\frac{\beta n}{2}\rfloor$ and $M=\lfloor\alpha2^{n}\rfloor$. In fact, the edge-isoperimetric problem is equivalent to a ball-noise stability problem which is a variant of the traditional (i.i.d.-) noise stability problem. Our results can be interpreted as bounds for the ball-noise stability problem.
Modern deep neural networks are often too large to use in many practical scenarios. Neural network pruning is an important technique for reducing the size of such models and accelerating inference. Gibbs pruning is a novel framework for expressing and designing neural network pruning methods. Combining approaches from statistical physics and stochastic regularization methods, it can train and prune a network simultaneously in such a way that the learned weights and pruning mask are well-adapted for each other. It can be used for structured or unstructured pruning and we propose a number of specific methods for each. We compare our proposed methods to a number of contemporary neural network pruning methods and find that Gibbs pruning outperforms them. In particular, we achieve a new state-of-the-art result for pruning ResNet-56 with the CIFAR-10 dataset.
This work studies an experimental design problem where $x$'s are to be selected with the goal of estimating a function $m(x)$, which is observed with noise. A linear model is fitted to $m(x)$ but it is not assumed that the model is correctly specified. It follows that the quantity of interest is the best linear approximation of $m(x)$, which is denoted by $\ell(x)$. It is shown that in this framework the ordinary least squares estimator typically leads to an inconsistent estimation of $\ell(x)$, and rather weighted least squares should be considered. An asymptotic minimax criterion is formulated for this estimator, and a design that minimizes the criterion is constructed. An important feature of this problem is that the $x$'s should be random, rather than fixed. Otherwise, the minimax risk is infinite. It is shown that the optimal random minimax design is different from its deterministic counterpart, which was studied previously, and a simulation study indicates that it generally performs better when $m(x)$ is a quadratic or a cubic function. Another finding is that when the variance of the noise goes to infinity, the random and deterministic minimax designs coincide. The results are illustrated for polynomial regression models and different generalizations are presented.
In this paper, we want to clarify the Gibbs phenomenon when continuous and discontinuous finite elements are used to approximate discontinuous or nearly discontinuous PDE solutions from the approximation point of view. For a simple discontinuous function, we explicitly compute its continuous and discontinuous piecewise constant or linear projections on discontinuity matched or non-matched meshes. For the simple discontinuity-aligned mesh case, piecewise discontinuous approximations are always good. For the general non-matched case, we explain that the piecewise discontinuous constant approximation combined with adaptive mesh refinements is a good choice to achieve accuracy without overshoots. For discontinuous piecewise linear approximations, non-trivial overshoots will be observed unless the mesh is matched with discontinuity. For continuous piecewise linear approximations, the computation is based on a "far-away assumption", and non-trivial overshoots will always be observed under regular meshes. We calculate the explicit overshoot values for several typical cases. Numerical tests are conducted for a singularly-perturbed reaction-diffusion equation and linear hyperbolic equations to verify our findings in the paper. Also, we discuss the $L^1$-minimization based methods and do not recommend such methods due to their similar behavior as $L^2$-based methods and more complicated implementations.
We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.