We present a new finite-sample analysis of M-estimators of locations in $\mathbb{R}^d$ using the tool of the influence function. In particular, we show that the deviations of an M-estimator can be controlled thanks to its influence function (or its score function) and then, we use concentration inequality on M-estimators to investigate the robust estimation of the mean in high dimension in a corrupted setting (adversarial corruption setting) for bounded and unbounded score functions. For a sample of size $n$ and covariance matrix $\Sigma$, we attain the minimax speed $\sqrt{Tr(\Sigma)/n}+\sqrt{\|\Sigma\|_{op}\log(1/\delta)/n}$ with probability larger than $1-\delta$ in a heavy-tailed setting. One of the major advantages of our approach compared to others recently proposed is that our estimator is tractable and fast to compute even in very high dimension with a complexity of $O(nd\log(Tr(\Sigma)))$ where $n$ is the sample size and $\Sigma$ is the covariance matrix of the inliers. In practice, the code that we make available for this article proves to be very fast.
In Statistical Relational Artificial Intelligence, a branch of AI and machine learning which combines the logical and statistical schools of AI, one uses the concept {\em para\-metrized probabilistic graphical model (PPGM)} to model (conditional) dependencies between random variables and to make probabilistic inferences about events on a space of "possible worlds". The set of possible worlds with underlying domain $D$ (a set of objects) can be represented by the set $\mathbf{W}_D$ of all first-order structures (for a suitable signature) with domain $D$. Using a formal logic we can describe events on $\mathbf{W}_D$. By combining a logic and a PPGM we can also define a probability distribution $\mathbb{P}_D$ on $\mathbf{W}_D$ and use it to compute the probability of an event. We consider a logic, denoted $PLA$, with truth values in the unit interval, which uses aggregation functions, such as arithmetic mean, geometric mean, maximum and minimum instead of quantifiers. However we face the problem of computational efficiency and this problem is an obstacle to the wider use of methods from Statistical Relational AI in practical applications. We address this problem by proving that the described probability will, under certain assumptions on the PPGM and the sentence $\varphi$, converge as the size of $D$ tends to infinity. The convergence result is obtained by showing that every formula $\varphi(x_1, \ldots, x_k)$ which contains only "admissible" aggregation functions (e.g. arithmetic and geometric mean, max and min) is asymptotically equivalent to a formula $\psi(x_1, \ldots, x_k)$ without aggregation functions.
Although robust learning and local differential privacy are both widely studied fields of research, combining the two settings is just starting to be explored. We consider the problem of estimating a discrete distribution in total variation from $n$ contaminated data batches under a local differential privacy constraint. A fraction $1-\epsilon$ of the batches contain $k$ i.i.d. samples drawn from a discrete distribution $p$ over $d$ elements. To protect the users' privacy, each of the samples is privatized using an $\alpha$-locally differentially private mechanism. The remaining $\epsilon n $ batches are an adversarial contamination. The minimax rate of estimation under contamination alone, with no privacy, is known to be $\epsilon/\sqrt{k}+\sqrt{d/kn}$, up to a $\sqrt{\log(1/\epsilon)}$ factor. Under the privacy constraint alone, the minimax rate of estimation is $\sqrt{d^2/\alpha^2 kn}$. We show that combining the two constraints leads to a minimax estimation rate of $\epsilon\sqrt{d/\alpha^2 k}+\sqrt{d^2/\alpha^2 kn}$ up to a $\sqrt{\log(1/\epsilon)}$ factor, larger than the sum of the two separate rates. We provide a polynomial-time algorithm achieving this bound, as well as a matching information theoretic lower bound.
Graph Convolutional Networks (GCNs) are one of the most popular architectures that are used to solve classification problems accompanied by graphical information. We present a rigorous theoretical understanding of the effects of graph convolutions in multi-layer networks. We study these effects through the node classification problem of a non-linearly separable Gaussian mixture model coupled with a stochastic block model. First, we show that a single graph convolution expands the regime of the distance between the means where multi-layer networks can classify the data by a factor of at least $1/\sqrt[4]{\mathbb{E}{\rm deg}}$, where $\mathbb{E}{\rm deg}$ denotes the expected degree of a node. Second, we show that with a slightly stronger graph density, two graph convolutions improve this factor to at least $1/\sqrt[4]{n}$, where $n$ is the number of nodes in the graph. Finally, we provide both theoretical and empirical insights into the performance of graph convolutions placed in different combinations among the layers of a network, concluding that the performance is mutually similar for all combinations of the placement. We present extensive experiments on both synthetic and real-world data that illustrate our results.
The monotone variational inequality is a central problem in mathematical programming that unifies and generalizes many important settings such as smooth convex optimization, two-player zero-sum games, convex-concave saddle point problems, etc. The extragradient method by Korpelevich [1976] is one of the most popular methods for solving monotone variational inequalities. Despite its long history and intensive attention from the optimization and machine learning community, the following major problem remains open. What is the last-iterate convergence rate of the extragradient method for monotone and Lipschitz variational inequalities with constraints? We resolve this open problem by showing a tight $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence rate for arbitrary convex feasible sets, which matches the lower bound by Golowich et al. [2020]. Our rate is measured in terms of the standard gap function. The technical core of our result is the monotonicity of a new performance measure -- the tangent residual, which can be viewed as an adaptation of the norm of the operator that takes the local constraints into account. To establish the monotonicity, we develop a new approach that combines the power of the sum-of-squares programming with the low dimensionality of the update rule of the extragradient method. We believe our approach has many additional applications in the analysis of iterative methods.
Covariance estimation for matrix-valued data has received an increasing interest in applications. Unlike previous works that rely heavily on matrix normal distribution assumption and the requirement of fixed matrix size, we propose a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data under a separability condition and a bandable covariance structure. Under these conditions, the original covariance matrix is decomposed into a Kronecker product of two bandable small covariance matrices representing the variability over row and column directions. We formulate a unified framework for estimating bandable covariance, and introduce an efficient algorithm based on rank one unconstrained Kronecker product approximation. The convergence rates of the proposed estimators are established, and the derived minimax lower bound shows our proposed estimator is rate-optimal under certain divergence regimes of matrix size. We further introduce a class of robust covariance estimators and provide theoretical guarantees to deal with heavy-tailed data. We demonstrate the superior finite-sample performance of our methods using simulations and real applications from a gridded temperature anomalies dataset and a S&P 500 stock data analysis.
Let ${\mathcal M}\subset {\mathbb R}^n$ be a $C^2$-smooth compact submanifold of dimension $d$. Assume that the volume of ${\mathcal M}$ is at most $V$ and the reach (i.e. the normal injectivity radius) of ${\mathcal M}$ is greater than $\tau$. Moreover, let $\mu$ be a probability measure on ${\mathcal M}$ whose density on ${\mathcal M}$ is a strictly positive Lipschitz-smooth function. Let $x_j\in {\mathcal M}$, $j=1,2,\dots,N$ be $N$ independent random samples from distribution $\mu$. Also, let $\xi_j$, $j=1,2,\dots, N$ be independent random samples from a Gaussian random variable in ${\mathbb R}^n$ having covariance $\sigma^2I$, where $\sigma$ is less than a certain specified function of $d, V$ and $\tau$. We assume that we are given the data points $y_j=x_j+\xi_j,$ $j=1,2,\dots,N$, modelling random points of ${\mathcal M}$ with measurement noise. We develop an algorithm which produces from these data, with high probability, a $d$ dimensional submanifold ${\mathcal M}_o\subset {\mathbb R}^n$ whose Hausdorff distance to ${\mathcal M}$ is less than $Cd\sigma^2/\tau$ and whose reach is greater than $c{\tau}/d^6$ with universal constants $C,c > 0$. The number $N$ of random samples required depends almost linearly on $n$, polynomially on $\sigma^{-1}$ and exponentially on $d$.
The dynamic response of the legged robot locomotion is non-Lipschitz and can be stochastic due to environmental uncertainties. To test, validate, and characterize the safety performance of legged robots, existing solutions on observed and inferred risk can be incomplete and sampling inefficient. Some formal verification methods suffer from the model precision and other surrogate assumptions. In this paper, we propose a scenario sampling based testing framework that characterizes the overall safety performance of a legged robot by specifying (i) where (in terms of a set of states) the robot is potentially safe, and (ii) how safe the robot is within the specified set. The framework can also help certify the commercial deployment of the legged robot in real-world environment along with human and compare safety performance among legged robots with different mechanical structures and dynamic properties. The proposed framework is further deployed to evaluate a group of state-of-the-art legged robot locomotion controllers from various model-based, deep neural network involved, and reinforcement learning based methods in the literature. Among a series of intended work domains of the studied legged robots (e.g. tracking speed on sloped surface, with abrupt changes on demanded velocity, and against adversarial push-over disturbances), we show that the method can adequately capture the overall safety characterization and the subtle performance insights. Many of the observed safety outcomes, to the best of our knowledge, have never been reported by the existing work in the legged robot literature.
The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications. Especially, its variance reduced versions have nowadays gained particular attention. In this paper, we study two variants of this kind, namely, the Stochastic Variance Reduced Gradient Langevin Dynamics and the Stochastic Recursive Gradient Langevin Dynamics. We prove their convergence to the objective distribution in terms of KL-divergence under the sole assumptions of smoothness and Log-Sobolev inequality which are weaker conditions than those used in prior works for these algorithms. With the batch size and the inner loop length set to $\sqrt{n}$, the gradient complexity to achieve an $\epsilon$-precision is $\tilde{O}((n+dn^{1/2}\epsilon^{-1})\gamma^2 L^2\alpha^{-2})$, which is an improvement from any previous analyses. We also show some essential applications of our result to non-convex optimization.
In this paper we study the finite sample and asymptotic properties of various weighting estimators of the local average treatment effect (LATE), several of which are based on Abadie (2003)'s kappa theorem. Our framework presumes a binary endogenous explanatory variable ("treatment") and a binary instrumental variable, which may only be valid after conditioning on additional covariates. We argue that one of the Abadie estimators, which we show is weight normalized, is likely to dominate the others in many contexts. A notable exception is in settings with one-sided noncompliance, where certain unnormalized estimators have the advantage of being based on a denominator that is bounded away from zero. We use a simulation study and three empirical applications to illustrate our findings. In applications to causal effects of college education using the college proximity instrument (Card, 1995) and causal effects of childbearing using the sibling sex composition instrument (Angrist and Evans, 1998), the unnormalized estimates are clearly unreasonable, with "incorrect" signs, magnitudes, or both. Overall, our results suggest that (i) the relative performance of different kappa weighting estimators varies with features of the data-generating process; and that (ii) the normalized version of Tan (2006)'s estimator may be an attractive alternative in many contexts. Applied researchers with access to a binary instrumental variable should also consider covariate balancing or doubly robust estimators of the LATE.
Bayesian model selection provides a powerful framework for objectively comparing models directly from observed data, without reference to ground truth data. However, Bayesian model selection requires the computation of the marginal likelihood (model evidence), which is computationally challenging, prohibiting its use in many high-dimensional Bayesian inverse problems. With Bayesian imaging applications in mind, in this work we present the proximal nested sampling methodology to objectively compare alternative Bayesian imaging models for applications that use images to inform decisions under uncertainty. The methodology is based on nested sampling, a Monte Carlo approach specialised for model comparison, and exploits proximal Markov chain Monte Carlo techniques to scale efficiently to large problems and to tackle models that are log-concave and not necessarily smooth (e.g., involving l_1 or total-variation priors). The proposed approach can be applied computationally to problems of dimension O(10^6) and beyond, making it suitable for high-dimensional inverse imaging problems. It is validated on large Gaussian models, for which the likelihood is available analytically, and subsequently illustrated on a range of imaging problems where it is used to analyse different choices of dictionary and measurement model.