Johnson-Lindenstrauss lemma states random projections can be used as a topology preserving embedding technique for fixed vectors. In this paper, we try to understand how random projections affect probabilistic properties of random vectors. In particular we prove the distribution of inner product of two independent random vectors $X, Z \in {R}^n$ is preserved by random projection $S:{R}^n \to {R}^m$. More precisely, \[ \sup_t \left| \text{P}(\frac{1}{C_{m,n}} X^TS^TSZ <t) - \text{P}(\frac{1}{\sqrt{n}} X^TZ<t) \right| \le O\left(\frac{1}{\sqrt{n}}+ \frac{1}{\sqrt{m}} \right) \] This is achieved by proving a general central limit theorem (product-CLT) for $\sum_{k=1}^{n} X_k Y_k$, where $\{X_k\}$ is a martingale difference sequence, and $\{Y_k\}$ has dependency within the sequence. We also obtain the rate of convergence in the spirit of Berry-Esseen theorem.
Studying the properties of stochastic noise to optimize complex non-convex functions has been an active area of research in the field of machine learning. Prior work has shown that the noise of stochastic gradient descent improves optimization by overcoming undesirable obstacles in the landscape. Moreover, injecting artificial Gaussian noise has become a popular idea to quickly escape saddle points. Indeed, in the absence of reliable gradient information, the noise is used to explore the landscape, but it is unclear what type of noise is optimal in terms of exploration ability. In order to narrow this gap in our knowledge, we study a general type of continuous-time non-Markovian process, based on fractional Brownian motion, that allows for the increments of the process to be correlated. This generalizes processes based on Brownian motion, such as the Ornstein-Uhlenbeck process. We demonstrate how to discretize such processes which gives rise to the new algorithm fPGD. This method is a generalization of the known algorithms PGD and Anti-PGD. We study the properties of fPGD both theoretically and empirically, demonstrating that it possesses exploration abilities that, in some cases, are favorable over PGD and Anti-PGD. These results open the field to novel ways to exploit noise for training machine learning models.
We study the problem of unbiased estimation of expectations with respect to (w.r.t.) $\pi$ a given, general probability measure on $(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$ that is absolutely continuous with respect to a standard Gaussian measure. We focus on simulation associated to a particular class of diffusion processes, sometimes termed the Schr\"odinger-F\"ollmer Sampler, which is a simulation technique that approximates the law of a particular diffusion bridge process $\{X_t\}_{t\in [0,1]}$ on $\mathbb{R}^d$, $d\in \mathbb{N}_0$. This latter process is constructed such that, starting at $X_0=0$, one has $X_1\sim \pi$. Typically, the drift of the diffusion is intractable and, even if it were not, exact sampling of the associated diffusion is not possible. As a result, \cite{sf_orig,jiao} consider a stochastic Euler-Maruyama scheme that allows the development of biased estimators for expectations w.r.t.~$\pi$. We show that for this methodology to achieve a mean square error of $\mathcal{O}(\epsilon^2)$, for arbitrary $\epsilon>0$, the associated cost is $\mathcal{O}(\epsilon^{-5})$. We then introduce an alternative approach that provides unbiased estimates of expectations w.r.t.~$\pi$, that is, it does not suffer from the time discretization bias or the bias related with the approximation of the drift function. We prove that to achieve a mean square error of $\mathcal{O}(\epsilon^2)$, the associated cost is, with high probability, $\mathcal{O}(\epsilon^{-2}|\log(\epsilon)|^{2+\delta})$, for any $\delta>0$. We implement our method on several examples including Bayesian inverse problems.
The argmax theorem is a useful result for deriving the limiting distribution of estimators in many applications. The conclusion of the argmax theorem states that the argmax of a sequence of stochastic processes converges in distribution to the argmax of a limiting stochastic process. This paper generalizes the argmax theorem to allow the maximization to take place over a sequence of subsets of the domain. If the sequence of subsets converges to a limiting subset, then the conclusion of the argmax theorem continues to hold. We demonstrate the usefulness of this generalization in three applications: estimating a structural break, estimating a parameter on the boundary of the parameter space, and estimating a weakly identified parameter. The generalized argmax theorem simplifies the proofs for existing results and can be used to prove new results in these literatures.
Federated learning, where algorithms are trained across multiple decentralized devices without sharing local data, is increasingly popular in distributed machine learning practice. Typically, a graph structure $G$ exists behind local devices for communication. In this work, we consider parameter estimation in federated learning with data distribution and communication heterogeneity, as well as limited computational capacity of local devices. We encode the distribution heterogeneity by parametrizing distributions on local devices with a set of distinct $p$-dimensional vectors. We then propose to jointly estimate parameters of all devices under the $M$-estimation framework with the fused Lasso regularization, encouraging an equal estimate of parameters on connected devices in $G$. We provide a general result for our estimator depending on $G$, which can be further calibrated to obtain convergence rates for various specific problem setups. Surprisingly, our estimator attains the optimal rate under certain graph fidelity condition on $G$, as if we could aggregate all samples sharing the same distribution. If the graph fidelity condition is not met, we propose an edge selection procedure via multiple testing to ensure the optimality. To ease the burden of local computation, a decentralized stochastic version of ADMM is provided, with convergence rate $O(T^{-1}\log T)$ where $T$ denotes the number of iterations. We highlight that, our algorithm transmits only parameters along edges of $G$ at each iteration, without requiring a central machine, which preserves privacy. We further extend it to the case where devices are randomly inaccessible during the training process, with a similar algorithmic convergence guarantee. The computational and statistical efficiency of our method is evidenced by simulation experiments and the 2020 US presidential election data set.
Reaching a consensus in a swarm of robots is one of the fundamental problems in swarm robotics, examining the possibility of reaching an agreement within the swarm members. The recently-introduced contamination problem offers a new perspective of the problem, in which swarm members should reach a consensus in spite of the existence of adversarial members that intentionally act to divert the swarm members towards a different consensus. In this paper, we search for a consensus-reaching algorithm under the contamination problem setting by taking a top-down approach: We transform the problem to a centralized two-player game in which each player controls the behavior of a subset of the swarm, trying to force the entire swarm to converge to an agreement on its own value. We define a performance metric for each players performance, proving a correlation between this metric and the chances of the player to win the game. We then present the globally optimal solution to the game and prove that unfortunately it is unattainable in a distributed setting, due to the challenging characteristics of the swarm members. We therefore examine the problem on a simplified swarm model, and compare the performance of the globally optimal strategy with locally optimal strategies, demonstrating its superiority in rigorous simulation experiments.
Learning sketching matrices for fast and accurate low-rank approximation (LRA) has gained increasing attention. Recently, Bartlett, Indyk, and Wagner (COLT 2022) presented a generalization bound for the learning-based LRA. Specifically, for rank-$k$ approximation using an $m \times n$ learned sketching matrix with $s$ non-zeros in each column, they proved an $\tilde{\mathrm{O}}(nsm)$ bound on the \emph{fat shattering dimension} ($\tilde{\mathrm{O}}$ hides logarithmic factors). We build on their work and make two contributions. 1. We present a better $\tilde{\mathrm{O}}(nsk)$ bound ($k \le m$). En route to obtaining the bound, we give a low-complexity \emph{Goldberg--Jerrum algorithm} for computing pseudo-inverse matrices, which would be of independent interest. 2. We alleviate an assumption of the previous study that the sparsity pattern of sketching matrices is fixed. We prove that learning positions of non-zeros increases the fat shattering dimension only by ${\mathrm{O}}(ns\log n)$. Also, experiments confirm the practical benefit of learning sparsity patterns.
The $h$-index is a metric used to measure the impact of a user in a publication setting, such as a member of a social network with many highly liked posts or a researcher in an academic domain with many highly cited publications. Specifically, the $h$-index of a user is the largest integer $h$ such that at least $h$ publications of the user have at least $h$ units of positive feedback. We design an algorithm that, given query access to the $n$ publications of a user and each publication's corresponding positive feedback number, outputs a $(1\pm \varepsilon)$-approximation of the $h$-index of this user with probability at least $1-\delta$ in time \[ O(\frac{n \cdot \ln{(1/\delta)}}{\varepsilon^2 \cdot h}), \] where $h$ is the actual $h$-index which is unknown to the algorithm a-priori. We then design a novel lower bound technique that allows us to prove that this bound is in fact asymptotically optimal for this problem in all parameters $n,h,\varepsilon,$ and $\delta$. Our work is one of the first in sublinear time algorithms that addresses obtaining asymptotically optimal bounds, especially in terms of the error and confidence parameters. As such, we focus on designing novel techniques for this task. In particular, our lower bound technique seems quite general -- to showcase this, we also use our approach to prove an asymptotically optimal lower bound for the problem of estimating the number of triangles in a graph in sublinear time, which now is also optimal in the error and confidence parameters. This result improves upon prior lower bounds of Eden, Levi, Ron, and Seshadhri (FOCS'15) for this problem, as well as multiple follow-ups that extended this lower bound to other subgraph counting problems.
A support vector machine (SVM) is an algorithm that finds a hyperplane which optimally separates labeled data points in $\mathbb{R}^n$ into positive and negative classes. The data points on the margin of this separating hyperplane are called support vectors. We connect the possible configurations of support vectors to Radon's theorem, which provides guarantees for when a set of points can be divided into two classes (positive and negative) whose convex hulls intersect. If the convex hulls of the positive and negative support vectors are projected onto a separating hyperplane, then the projections intersect if and only if the hyperplane is optimal. Further, with a particular type of general position, we show that (a) the projected convex hulls of the support vectors intersect in exactly one point, (b) the support vectors are stable under perturbation, (c) there are at most $n+1$ support vectors, and (d) every number of support vectors from 2 up to $n+1$ is possible. Finally, we perform computer simulations studying the expected number of support vectors, and their configurations, for randomly generated data. We observe that as the distance between classes of points increases for this type of randomly generated data, configurations with fewer support vectors become more likely.
Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also provide a sharp trade-off between the amount of UCB exploration and the tail exponent of the resulting regret distribution.
Learning on big data brings success for artificial intelligence (AI), but the annotation and training costs are expensive. In future, learning on small data is one of the ultimate purposes of AI, which requires machines to recognize objectives and scenarios relying on small data as humans. A series of machine learning models is going on this way such as active learning, few-shot learning, deep clustering. However, there are few theoretical guarantees for their generalization performance. Moreover, most of their settings are passive, that is, the label distribution is explicitly controlled by one specified sampling scenario. This survey follows the agnostic active sampling under a PAC (Probably Approximately Correct) framework to analyze the generalization error and label complexity of learning on small data using a supervised and unsupervised fashion. With these theoretical analyses, we categorize the small data learning models from two geometric perspectives: the Euclidean and non-Euclidean (hyperbolic) mean representation, where their optimization solutions are also presented and discussed. Later, some potential learning scenarios that may benefit from small data learning are then summarized, and their potential learning scenarios are also analyzed. Finally, some challenging applications such as computer vision, natural language processing that may benefit from learning on small data are also surveyed.