We consider $L^2$-approximation on weighted reproducing kernel Hilbert spaces of functions depending on infinitely many variables. We focus on unrestricted linear information, admitting evaluations of arbitrary continuous linear functionals. We distinguish between ANOVA and non-ANOVA spaces, where, by ANOVA spaces, we refer to function spaces whose norms are induced by an underlying ANOVA function decomposition. In ANOVA spaces, we prove that there is an optimal algorithm to solve the approximation problem using linear information. This way, we can determine the exact polynomial convergence rate of $n$-th minimal worst-case errors. For non-ANOVA spaces, we also establish upper and lower error bounds. Even though the bounds do not match in this case, they reveal that for weights with a moderate decay behavior, the convergence rate of $n$-th minimal errors is strictly higher in ANOVA than in non-ANOVA spaces.
It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or $\chi^2$-divergence. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we propose to combine the general strategy for lower bounds with a reduction technique. This allows us to reduce the original problem to a lower bound on the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. In the Gaussian sequence model, different phase transitions of the bias-variance trade-off occur. Although there is a non-trivial interplay between bias and variance, the rate of the squared bias and the variance do not have to be balanced in order to achieve the minimax estimation rate.
The Variational Monte Carlo (VMC) is a promising approach for computing the ground state energy of many-body quantum problems and attracts more and more interests due to the development of machine learning. The recent paradigms in VMC construct neural networks as trial wave functions, sample quantum configurations using Markov chain Monte Carlo (MCMC) and train neural networks with stochastic gradient descent (SGD) method. However, the theoretical convergence of VMC is still unknown when SGD interacts with MCMC sampling given a well-designed trial wave function. Since MCMC reduces the difficulty of estimating gradients, it has inevitable bias in practice. Moreover, the local energy may be unbounded, which makes it harder to analyze the error of MCMC sampling. Therefore, we assume that the local energy is sub-exponential and use the Bernstein inequality for non-stationary Markov chains to derive error bounds of the MCMC estimator. Consequently, VMC is proven to have a first order convergence rate $O(\log K/\sqrt{n K})$ with $K$ iterations and a sample size $n$. It partially explains how MCMC influences the behavior of SGD. Furthermore, we verify the so-called correlated negative curvature condition and relate it to the zero-variance phenomena in solving eigenvalue functions. It is shown that VMC escapes from saddle points and reaches $(\epsilon,\epsilon^{1/4})$ -approximate second order stationary points or $\epsilon^{1/2}$-variance points in at least $O(\epsilon^{-11/2}\log^{2}(1/\epsilon) )$ steps with high probability. Our analysis enriches the understanding of how VMC converges efficiently and can be applied to general variational methods in physics and statistics.
A novel elastic time distance for sparse multivariate functional data is proposed and used to develop a robust distance-based two-layer partition clustering method. With this proposed distance, the new approach not only can detect correct clusters for sparse multivariate functional data under outlier settings but also can detect those outliers that do not belong to any clusters. Classical distance-based clustering methods such as density-based spatial clustering of applications with noise (DBSCAN), agglomerative hierarchical clustering, and $K$-medoids are extended to the sparse multivariate functional case based on the newly-proposed distance. Numerical experiments on simulated data highlight that the performance of the proposed algorithm is superior to the performances of existing model-based and extended distance-based methods. The effectiveness of the proposed approach is demonstrated using Northwest Pacific cyclone tracks data as an example.
We consider network games where a large number of agents interact according to a network sampled from a random network model, represented by a graphon. By exploiting previous results on convergence of such large network games to graphon games, we examine a procedure for estimating unknown payoff parameters, from observations of equilibrium actions, without the need for exact network information. We prove smoothness and local convexity of the optimization problem involved in computing the proposed estimator. Additionally, under a notion of graphon parameter identifiability, we show that the optimal estimator is globally unique. We present several examples of identifiable homogeneous and heterogeneous parameters in different classes of linear quadratic network games with numerical simulations to validate the proposed estimator.
The Wasserstein distance between mixing measures has come to occupy a central place in the statistical analysis of mixture models. This work proposes a new canonical interpretation of this distance and provides tools to perform inference on the Wasserstein distance between mixing measures in topic models. We consider the general setting of an identifiable mixture model consisting of mixtures of distributions from a set $\mathcal{A}$ equipped with an arbitrary metric $d$, and show that the Wasserstein distance between mixing measures is uniquely characterized as the most discriminative convex extension of the metric $d$ to the set of mixtures of elements of $\mathcal{A}$. The Wasserstein distance between mixing measures has been widely used in the study of such models, but without axiomatic justification. Our results establish this metric to be a canonical choice. Specializing our results to topic models, we consider estimation and inference of this distance. Though upper bounds for its estimation have been recently established elsewhere, we prove the first minimax lower bounds for the estimation of the Wasserstein distance in topic models. We also establish fully data-driven inferential tools for the Wasserstein distance in the topic model context. Our results apply to potentially sparse mixtures of high-dimensional discrete probability distributions. These results allow us to obtain the first asymptotically valid confidence intervals for the Wasserstein distance in topic models.
This paper considers statistical inference of time-varying network vector autoregression models for large-scale time series. A latent group structure is imposed on the heterogeneous and node-specific time-varying momentum and network spillover effects so that the number of unknown time-varying coefficients to be estimated can be reduced considerably. A classic agglomerative clustering algorithm with normalized distance matrix estimates is combined with a generalized information criterion to consistently estimate the latent group number and membership. A post-grouping local linear smoothing method is proposed to estimate the group-specific time-varying momentum and network effects, substantially improving the convergence rates of the preliminary estimates which ignore the latent structure. In addition, a post-grouping specification test is conducted to verify the validity of the parametric model assumption for group-specific time-varying coefficient functions, and the asymptotic theory is derived for the test statistic constructed via a kernel weighted quadratic form under the null and alternative hypotheses. Numerical studies including Monte-Carlo simulation and an empirical application to the global trade flow data are presented to examine the finite-sample performance of the developed model and methodology.
Interleaving is an online evaluation approach for information retrieval systems that compares the effectiveness of ranking functions in interpreting the users' implicit feedback. Previous work such as Hofmann et al (2011) has evaluated the most promising interleaved methods at the time, on uniform distributions of queries. In the real world, ordinarily, there is an unbalanced distribution of repeated queries that follows a long-tailed users' search demand curve. The more a query is executed, by different users (or in different sessions), the higher the probability of collecting implicit feedback (interactions/clicks) on the related search results. This paper first aims to replicate the Team Draft Interleaving accuracy evaluation on uniform query distributions and then focuses on assessing how this method generalizes to long-tailed real-world scenarios. The reproducibility work raised interesting considerations on how the winning ranking function for each query should impact the overall winner for the entire evaluation. Based on what was observed, we propose that not all the queries should contribute to the final decision in equal proportion. As a result of these insights, we designed two variations of the $\Delta_{AB}$ score winner estimator that assign to each query a credit based on statistical hypothesis testing. To replicate, reproduce and extend the original work, we have developed from scratch a system that simulates a search engine and users' interactions from datasets from the industry. Our experiments confirm our intuition and show that our methods are promising in terms of accuracy, sensitivity, and robustness to noise.
Learning precise surrogate models of complex computer simulations and physical machines often require long-lasting or expensive experiments. Furthermore, the modeled physical dependencies exhibit nonlinear and nonstationary behavior. Machine learning methods that are used to produce the surrogate model should therefore address these problems by providing a scheme to keep the number of queries small, e.g. by using active learning and be able to capture the nonlinear and nonstationary properties of the system. One way of modeling the nonstationarity is to induce input-partitioning, a principle that has proven to be advantageous in active learning for Gaussian processes. However, these methods either assume a known partitioning, need to introduce complex sampling schemes or rely on very simple geometries. In this work, we present a simple, yet powerful kernel family that incorporates a partitioning that: i) is learnable via gradient-based methods, ii) uses a geometry that is more flexible than previous ones, while still being applicable in the low data regime. Thus, it provides a good prior for active learning procedures. We empirically demonstrate excellent performance on various active learning tasks.
We discuss Bayesian inference for a known-mean Gaussian model with a compound symmetric variance-covariance matrix. Since the space of such matrices is a linear subspace of that of positive definite matrices, we utilize the methods of Pisano (2022) to decompose the usual Wishart conjugate prior and derive a closed-form, three-parameter, bivariate conjugate prior distribution for the compound-symmetric half-precision matrix. The off-diagonal entry is found to have a non-central Kummer-Beta distribution conditioned on the diagonal, which is shown to have a gamma distribution generalized with Gauss's hypergeometric function. Such considerations yield a treatment of maximum a posteriori estimation for such matrices in Gaussian settings, including the Bayesian evidence and flexibility penalty attributable to Rougier and Priebe (2019). We also demonstrate how the prior may be utilized to naturally test for the positivity of a common within-class correlation in a random-intercept model using two data-driven examples.
Approximating convex bodies is a fundamental question in geometry and has a wide variety of applications. Consider a convex body $K$ of diameter $\Delta$ in $\textbf{R}^d$ for fixed $d$. The objective is to minimize the number of vertices (alternatively, the number of facets) of an approximating polytope for a given Hausdorff error $\varepsilon$. It is known from classical results of Dudley (1974) and Bronshteyn and Ivanov (1976) that $\Theta((\Delta/\varepsilon)^{(d-1)/2})$ vertices (alternatively, facets) are both necessary and sufficient. While this bound is tight in the worst case, that of Euclidean balls, it is far from optimal for skinny convex bodies. A natural way to characterize a convex object's skinniness is in terms of its relationship to the Euclidean ball. Given a convex body $K$, define its \emph{volume diameter} $\Delta_d$ to be the diameter of a Euclidean ball of the same volume as $K$, and define its \emph{surface diameter} $\Delta_{d-1}$ analogously for surface area. It follows from generalizations of the isoperimetric inequality that $\Delta \geq \Delta_{d-1} \geq \Delta_d$. Arya, da Fonseca, and Mount (SoCG 2012) demonstrated that the diameter-based bound could be made surface-area sensitive, improving the above bound to $O((\Delta_{d-1}/\varepsilon)^{(d-1)/2})$. In this paper, we strengthen this by proving the existence of an approximation with $O((\Delta_d/\varepsilon)^{(d-1)/2})$ facets.