The reach of a set $M \subset \mathbb R^d$, also known as condition number when $M$ is a manifold, was introduced by Federer in 1959 and is a central concept in geometric measure theory, set estimation, manifold learning, among others areas. We introduce a universally consistent estimate of the reach, just assuming that the reach is positive. A necessary condition for the universal convergence of the reach is that the Haussdorf distance between the sample and the set converges to zero. Without further assumptions we show that the convergence rate of this distance can be arbitrarily slow. However, under a weak additional assumption, we provide rates of convergence for the reach estimator. We also show that it is not possible to determine if the reach of the support of a density is zero or not based on a finite sample. We provide a small simulation study and a bias correction method for the case when $M$ is a manifold.
This paper introduces SPDE bridges with observation noise and contains an analysis of their spatially semidiscrete approximations. The SPDEs are considered in the form of mild solutions in an abstract Hilbert space framework suitable for parabolic equations. They are assumed to be linear with additive noise in the form of a cylindrical Wiener process. The observational noise is also cylindrical and SPDE bridges are formulated via conditional distributions of Gaussian random variables in Hilbert spaces. A general framework for the spatial discretization of these bridge processes is introduced. Explicit convergence rates are derived for a spectral and a finite element based method. It is shown that for sufficiently rough observation noise, the rates are essentially the same as those of the corresponding discretization of the original SPDE.
In the field of finance, insurance, and system reliability, etc., it is often of interest to measure the dependence among variables by modeling a multivariate distribution using a copula. The copula models with parametric assumptions are easy to estimate but can be highly biased when such assumptions are false, while the empirical copulas are non-smooth and often not genuine copula making the inference about dependence challenging in practice. As a compromise, the empirical Bernstein copula provides a smooth estimator but the estimation of tuning parameters remains elusive. In this paper, by using the so-called empirical checkerboard copula we build a hierarchical empirical Bayes model that enables the estimation of a smooth copula function for arbitrary dimensions. The proposed estimator based on the multivariate Bernstein polynomials is itself a genuine copula and the selection of its dimension-varying degrees is data-dependent. We also show that the proposed copula estimator provides a more accurate estimate of several multivariate dependence measures which can be obtained in closed form. We investigate the asymptotic and finite-sample performance of the proposed estimator and compare it with some nonparametric estimators through simulation studies. An application to portfolio risk management is presented along with a quantification of estimation uncertainty.
We define a bivariate copula that captures the scale-invariant extent of dependence of a single random variable $Y$ on a set of potential explanatory random variables $X_1, \dots, X_d$. The copula itself contains the information whether $Y$ is completely dependent on $X_1, \dots, X_d$, and whether $Y$ and $X_1, \dots, X_d$ are independent. Evaluating this copula uniformly along the diagonal, i.e. calculating Spearman's footrule, leads to the so-called 'simple measure of conditional dependence' recently introduced by Azadkia and Chatterjee [1]. On the other hand, evaluating this copula uniformly over the unit square, i.e. calculating Spearman's rho, leads to a distribution-free coefficient of determination. Applying the techniques introduced in [1], we construct an estimate for this copula and show that this copula estimator is strongly consistent. Since, for $d=1$, the copula under consideration coincides with the well-known Markov product of copulas, as by-product, we also obtain a strongly consistent copula estimator for the Markov product. A simulation study illustrates the small sample performance of the proposed estimator.
A constraint satisfaction problem (CSP), $\textsf{Max-CSP}(\mathcal{F})$, is specified by a finite set of constraints $\mathcal{F} \subseteq \{[q]^k \to \{0,1\}\}$ for positive integers $q$ and $k$. An instance of the problem on $n$ variables is given by $m$ applications of constraints from $\mathcal{F}$ to subsequences of the $n$ variables, and the goal is to find an assignment to the variables that satisfies the maximum number of constraints. In the $(\gamma,\beta)$-approximation version of the problem for parameters $0 \leq \beta < \gamma \leq 1$, the goal is to distinguish instances where at least $\gamma$ fraction of the constraints can be satisfied from instances where at most $\beta$ fraction of the constraints can be satisfied. In this work we consider the approximability of this problem in the context of sketching algorithms and give a dichotomy result. Specifically, for every family $\mathcal{F}$ and every $\beta < \gamma$, we show that either a linear sketching algorithm solves the problem in polylogarithmic space, or the problem is not solvable by any sketching algorithm in $o(\sqrt{n})$ space.
Discrepancy measures between probability distributions, often termed statistical distances, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalability of this framework to high dimensions, we investigate the structural and statistical behavior of the Gaussian-smoothed $p$-Wasserstein distance $\mathsf{W}_p^{(\sigma)}$, for arbitrary $p\geq 1$. After establishing basic metric and topological properties of $\mathsf{W}_p^{(\sigma)}$, we explore the asymptotic statistical behavior of $\mathsf{W}_p^{(\sigma)}(\hat{\mu}_n,\mu)$, where $\hat{\mu}_n$ is the empirical distribution of $n$ independent observations from $\mu$. We prove that $\mathsf{W}_p^{(\sigma)}$ enjoys a parametric empirical convergence rate of $n^{-1/2}$, which contrasts the $n^{-1/d}$ rate for unsmoothed $\mathsf{W}_p$ when $d \geq 3$. Our proof relies on controlling $\mathsf{W}_p^{(\sigma)}$ by a $p$th-order smooth Sobolev distance $\mathsf{d}_p^{(\sigma)}$ and deriving the limit distribution of $\sqrt{n}\,\mathsf{d}_p^{(\sigma)}(\hat{\mu}_n,\mu)$, for all dimensions $d$. As applications, we provide asymptotic guarantees for two-sample testing and minimum distance estimation using $\mathsf{W}_p^{(\sigma)}$, with experiments for $p=2$ using a maximum mean discrepancy formulation of $\mathsf{d}_2^{(\sigma)}$.
We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $\Omega(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nystr\"{o}m method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-$s$ approximation with just $O(ns)$ similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in the downstream tasks of document classification, sentence similarity, and cross-document coreference.
We study the problem of estimating the density $f(\boldsymbol x)$ of a random vector ${\boldsymbol X}$ in $\mathbb R^d$. For a spanning tree $T$ defined on the vertex set $\{1,\dots ,d\}$, the tree density $f_{T}$ is a product of bivariate conditional densities. An optimal spanning tree minimizes the Kullback-Leibler divergence between $f$ and $f_{T}$. From i.i.d. data we identify an optimal tree $T^*$ and efficiently construct a tree density estimate $f_n$ such that, without any regularity conditions on the density $f$, one has $\lim_{n\to \infty} \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x=0$ a.s. For Lipschitz $f$ with bounded support, $\mathbb E \left\{ \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x\right\}=O\big(n^{-1/4}\big)$, a dimension-free rate.
We prove various approximation theorems with polynomials whose coefficients with respect to the Bernstein basis of a given order are all integers. In the extreme, we draw the coefficients from the set $\{ \pm 1\}$ only. We show that for any Lipschitz function $f:[0,1] \to [-1,1]$ and for any positive integer $n$, there are signs $\sigma_0,\dots,\sigma_n \in \{\pm 1\}$ such that $$\left |f(x) - \sum_{k=0}^n \sigma_k \, \binom{n}{k} x^k (1-x)^{n-k} \right | \leq \frac{C (1+|f|_{\mathrm{Lip}})}{1+\sqrt{nx(1-x)}} ~\mbox{ for all } x \in [0,1].$$ These polynomial approximations are not constrained by saturation of Bernstein polynomials, and we show that higher accuracy is indeed achievable for smooth functions: If $f$ has a Lipschitz $(s{-}1)$st derivative, then accuracy of order $O(n^{-s/2})$ is achievable with $\pm 1$ coefficients provided $\|f \|_\infty < 1$, and accuracy of order $O(n^{-s})$ is achievable with unrestricted integer coefficients. Our approximations are constructive in nature.
While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent work has shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al.~\cite{AroraGeLiMaZh17} suggests a dilemma about GANs' statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse. In contrast, we show in this paper that GANs can in principle learn distributions in Wasserstein distance (or KL-divergence in many cases) with polynomial sample complexity, if the discriminator class has strong distinguishing power against the particular generator class (instead of against all possible generators). For various generator classes such as mixture of Gaussians, exponential families, and invertible neural networks generators, we design corresponding discriminators (which are often neural nets of specific architectures) such that the Integral Probability Metric (IPM) induced by the discriminators can provably approximate the Wasserstein distance and/or KL-divergence. This implies that if the training is successful, then the learned distribution is close to the true distribution in Wasserstein distance or KL divergence, and thus cannot drop modes. Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence, indicating that the lack of diversity may be caused by the sub-optimality in optimization instead of statistical inefficiency.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.