Estimating the unknown density from which a given independent sample originates is more difficult than estimating the mean, in the sense that for the best popular non-parametric density estimators, the mean integrated square error converges more slowly than at the canonical rate of $\mathcal{O}(1/n)$. When the sample is generated from a simulation model and we have control over how this is done, we can do better. We examine an approach in which conditional Monte Carlo yields, under certain conditions, a random conditional density which is an unbiased estimator of the true density at any point. By averaging independent replications, we obtain a density estimator that converges at a faster rate than the usual ones. Moreover, combining this new type of estimator with randomized quasi-Monte Carlo to generate the samples typically brings a larger improvement on the error and convergence rate than for the usual estimators, because the new estimator is smoother as a function of the underlying uniform random numbers.
We consider neural network approximation spaces that classify functions according to the rate at which they can be approximated (with error measured in $L^p$) by ReLU neural networks with an increasing number of coefficients, subject to bounds on the magnitude of the coefficients and the number of hidden layers. We prove embedding theorems between these spaces for different values of $p$. Furthermore, we derive sharp embeddings of these approximation spaces into H\"older spaces. We find that, analogous to the case of classical function spaces (such as Sobolev spaces, or Besov spaces) it is possible to trade "smoothness" (i.e., approximation rate) for increased integrability. Combined with our earlier results in [arXiv:2104.02746], our embedding theorems imply a somewhat surprising fact related to "learning" functions from a given neural network space based on point samples: if accuracy is measured with respect to the uniform norm, then an optimal "learning" algorithm for reconstructing functions that are well approximable by ReLU neural networks is simply given by piecewise constant interpolation on a tensor product grid.
We consider the problem of approximating the arboricity of a graph $G= (V,E)$, which we denote by $\mathsf{arb}(G)$, in sublinear time, where the arboricity of a graph is the minimal number of forests required to cover its edges. An algorithm for this problem may perform degree and neighbor queries, and is allowed a small error probability. We design an algorithm that outputs an estimate $\hat{\alpha}$, such that with probability $1-1/\textrm{poly}(n)$, $\mathsf{arb}(G)/c\log^2 n \leq \hat{\alpha} \leq \mathsf{arb}(G)$, where $n=|V|$ and $c$ is a constant. The expected query complexity and running time of the algorithm are $O(n/\mathsf{arb}(G))\cdot \textrm{poly}(\log n)$, and this upper bound also holds with high probability. %($\widetilde{O}(\cdot)$ is used to suppress $\textrm{poly}(\log n)$ dependencies). This bound is optimal for such an approximation up to a $\textrm{poly}(\log n)$ factor.
We study prediction of future outcomes with supervised models that use privileged information during learning. The privileged information comprises samples of time series observed between the baseline time of prediction and the future outcome; this information is only available at training time which differs from the traditional supervised learning. Our question is when using this privileged data leads to more sample-efficient learning of models that use only baseline data for predictions at test time. We give an algorithm for this setting and prove that when the time series are drawn from a non-stationary Gaussian-linear dynamical system of fixed horizon, learning with privileged information is more efficient than learning without it. On synthetic data, we test the limits of our algorithm and theory, both when our assumptions hold and when they are violated. On three diverse real-world datasets, we show that our approach is generally preferable to classical learning, particularly when data is scarce. Finally, we relate our estimator to a distillation approach both theoretically and empirically.
Given a large set $U$ where each item $a\in U$ has weight $w(a)$, we want to estimate the total weight $W=\sum_{a\in U} w(a)$ to within factor of $1\pm\varepsilon$ with some constant probability $>1/2$. Since $n=|U|$ is large, we want to do this without looking at the entire set $U$. In the traditional setting in which we are allowed to sample elements from $U$ uniformly, sampling $\Omega(n)$ items is necessary to provide any non-trivial guarantee on the estimate. Therefore, we investigate this problem in different settings: in the \emph{proportional} setting we can sample items with probabilities proportional to their weights, and in the \emph{hybrid} setting we can sample both proportionally and uniformly. These settings have applications, for example, in sublinear-time algorithms and distribution testing. Sum estimation in the proportional and hybrid setting has been considered before by Motwani, Panigrahy, and Xu [ICALP, 2007]. In their paper, they give both upper and lower bounds in terms of $n$. Their bounds are near-matching in terms of $n$, but not in terms of $\varepsilon$. In this paper, we improve both their upper and lower bounds. Our bounds are matching up to constant factors in both settings, in terms of both $n$ and $\varepsilon$. No lower bounds with dependency on $\varepsilon$ were known previously. In the proportional setting, we improve their $\tilde{O}(\sqrt{n}/\varepsilon^{7/2})$ algorithm to $O(\sqrt{n}/\varepsilon)$. In the hybrid setting, we improve $\tilde{O}(\sqrt[3]{n}/ \varepsilon^{9/2})$ to $O(\sqrt[3]{n}/\varepsilon^{4/3})$. Our algorithms are also significantly simpler and do not have large constant factors. We also investigate the previously unexplored setting where $n$ is unknown to the algorithm. Finally, we show how our techniques apply to the problem of edge counting in graphs.
We consider \emph{Gibbs distributions}, which are families of probability distributions over a discrete space $\Omega$ with probability mass function of the form $\mu^\Omega_\beta(\omega) \propto e^{\beta H(\omega)}$ for $\beta$ in an interval $[\beta_{\min}, \beta_{\max}]$ and $H( \omega ) \in \{0 \} \cup [1, n]$. The \emph{partition function} is the normalization factor $Z(\beta)=\sum_{\omega \in\Omega}e^{\beta H(\omega)}$. Two important parameters of these distributions are the log partition ratio $q = \log \tfrac{Z(\beta_{\max})}{Z(\beta_{\min})}$ and the counts $c_x = |H^{-1}(x)|$. These are correlated with system parameters in a number of physical applications and sampling algorithms. Our first main result is to estimate the counts $c_x$ using roughly $\tilde O( \frac{q}{\varepsilon^2})$ samples for general Gibbs distributions and $\tilde O( \frac{n^2}{\varepsilon^2} )$ samples for integer-valued distributions (ignoring some second-order terms and parameters), and we show this is optimal up to logarithmic factors. We illustrate with improved algorithms for counting connected subgraphs and perfect matchings in a graph. We develop a key subroutine to estimate the partition function $Z$. Specifically, it generates a data structure to estimate $Z(\beta)$ for \emph{all} values $\beta$, without further samples. Constructing the data structure requires $O(\frac{q \log n}{\varepsilon^2})$ samples for general Gibbs distributions and $O(\frac{n^2 \log n}{\varepsilon^2} + n \log q)$ samples for integer-valued distributions. This improves over a prior algorithm of Huber (2015) which computes a single point estimate $Z(\beta_\max)$ using $O( q \log n( \log q + \log \log n + \varepsilon^{-2}))$ samples. We show matching lower bounds, demonstrating that this complexity is optimal as a function of $n$ and $q$ up to logarithmic terms.
The paper concerns convergence and asymptotic statistics for stochastic approximation driven by Markovian noise: $$ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) \,,\quad n\ge 0, $$ in which each $\theta_n\in\Re^d$, $ \{ \Phi_n \}$ is a Markov chain on a general state space X with stationary distribution $\pi$, and $f:\Re^d\times \text{X} \to\Re^d$. In addition to standard Lipschitz bounds on $f$, and conditions on the vanishing step-size sequence $\{\alpha_n\}$, it is assumed that the associated ODE is globally asymptotically stable with stationary point denoted $\theta^*$, where $\bar f(\theta)=E[f(\theta,\Phi)]$ with $\Phi\sim\pi$. Moreover, the ODE@$\infty$ defined with respect to the vector field, $$ \bar f_\infty(\theta):= \lim_{r\to\infty} r^{-1} \bar f(r\theta) \,,\qquad \theta\in\Re^d, $$ is asymptotically stable. The main contributions are summarized as follows: (i) The sequence $\theta$ is convergent if $\Phi$ is geometrically ergodic, and subject to compatible bounds on $f$. The remaining results are established under a stronger assumption on the Markov chain: A slightly weaker version of the Donsker-Varadhan Lyapunov drift condition known as (DV3). (ii) A Lyapunov function is constructed for the joint process $\{\theta_n,\Phi_n\}$ that implies convergence of $\{ \theta_n\}$ in $L_4$. (iii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error $z_n:= (\theta_n-\theta^*)/\sqrt{\alpha_n}$. Moment bounds combined with the CLT imply convergence of the normalized covariance, $$ \lim_{n \to \infty} E [ z_n z_n^T ] = \Sigma_\theta, $$ where $\Sigma_\theta$ is the asymptotic covariance appearing in the CLT. (iv) An example is provided where the Markov chain $\Phi$ is geometrically ergodic but it does not satisfy (DV3). While the algorithm is convergent, the second moment is unbounded.
We study approximation methods for a large class of mixed models with a probit link function that includes mixed versions of the binomial model, the multinomial model, and generalized survival models. The class of models is special because the marginal likelihood can be expressed as Gaussian weighted integrals or as multivariate Gaussian cumulative density functions. The latter approach is unique to the probit link function models and has been proposed for parameter estimation in complex, mixed effects models. However, it has not been investigated in which scenarios either form is preferable. Our simulations and data example show that neither form is preferable in general and give guidance on when to approximate the cumulative density functions and when to approximate the Gaussian weighted integrals and, in the case of the latter, which general purpose method to use among a large list of methods.
Consider the task of matrix estimation in which a dataset $X \in \mathbb{R}^{n\times m}$ is observed with sparsity $p$, and we would like to estimate $\mathbb{E}[X]$, where $\mathbb{E}[X_{ui}] = f(\alpha_u, \beta_i)$ for some Holder smooth function $f$. We consider the setting where the row covariates $\alpha$ are unobserved yet the column covariates $\beta$ are observed. We provide an algorithm and accompanying analysis which shows that our algorithm improves upon naively estimating each row separately when the number of rows is not too small. Furthermore when the matrix is moderately proportioned, our algorithm achieves the minimax optimal nonparametric rate of an oracle algorithm that knows the row covariates. In simulated experiments we show our algorithm outperforms other baselines in low data regimes.
We consider the Bayesian analysis of models in which the unknown distribution of the outcomes is specified up to a set of conditional moment restrictions. The nonparametric exponentially tilted empirical likelihood function is constructed to satisfy a sequence of unconditional moments based on an increasing (in sample size) vector of approximating functions (such as tensor splines based on the splines of each conditioning variable). For any given sample size, results are robust to the number of expanded moments. We derive Bernstein-von Mises theorems for the behavior of the posterior distribution under both correct and incorrect specification of the conditional moments, subject to growth rate conditions (slower under misspecification) on the number of approximating functions. A large-sample theory for comparing different conditional moment models is also developed. The central result is that the marginal likelihood criterion selects the model that is less misspecified. We also introduce sparsity-based model search for high-dimensional conditioning variables, and provide efficient MCMC computations for high-dimensional parameters. Along with clarifying examples, the framework is illustrated with real-data applications to risk-factor determination in finance, and causal inference under conditional ignorability.
In this paper several related estimation problems are addressed from a Bayesian point of view and optimal estimators are obtained for each of them when some natural loss functions are considered. Namely, we are interested in estimating a regression curve. Simultaneously, the estimation problems of a conditional distribution function, or a conditional density, or even the conditional distribution itself, are considered. All these problems are posed in a sufficiently general framework to cover continuous and discrete, univariate and multivariate, parametric and non-parametric cases, without the need to use a specific prior distribution. The loss functions considered come naturally from the quadratic error loss function comonly used in estimating a real function of the unknown parameter. The cornerstone of the mentioned Bayes estimators is the posterior predictive distribution. Some examples are provided to illustrate these results.