It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or chi-square-divergence. Some of these inequalities rely on a new concept of information matrices. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we combine the general strategy for lower bounds with a reduction technique. This allows us to link the original problem to the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. In the Gaussian sequence model, different phase transitions of the bias-variance trade-off occur. Although there is a non-trivial interplay between bias and variance, the rate of the squared bias and the variance do not have to be balanced in order to achieve the minimax estimation rate.
The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, the rank transformation offers a convenient and robust way of standardizing data in order to build an empirical version of the angular measure based on the most extreme observations. However, the study of the sampling distribution of the resulting empirical angular measure is challenging. It is the purpose of the paper to establish finite-sample bounds for the maximal deviations between the empirical and true angular measures, uniformly over classes of Borel sets of controlled combinatorial complexity. The bounds are valid with high probability and, up to logarithmic factors, scale as the square root of the effective sample size. The bounds are applied to provide performance guarantees for two statistical learning procedures tailored to extreme regions of the input space and built upon the empirical angular measure: binary classification in extreme regions through empirical risk minimization and unsupervised anomaly detection through minimum-volume sets of the sphere.
A common problem in data analysis is the separation of signal and background. We revisit and generalise the so-called $sWeights$ method, which allows one to calculate an empirical estimate of the signal density of a control variable using a fit of a mixed signal and background model to a discriminating variable. We show that $sWeights$ are a special case of a larger class of Custom Orthogonal Weight functions (COWs), which can be applied to a more general class of problems in which the discriminating and control variables are not necessarily independent and still achieve close to optimal performance. We also investigate the properties of parameters estimated from fits of statistical models to $sWeights$ and provide closed formulas for the asymptotic covariance matrix of the fitted parameters. To illustrate our findings, we discuss several practical applications of these techniques.
In this paper, we propose a variationally consistent technique for decreasing the maximum eigenfrequencies of structural dynamics related finite element formulations. Our approach is based on adding a symmetric positive-definite term to the mass matrix that follows from the integral of the traction jump across element boundaries. The added term is weighted by a small factor, for which we derive a suitable, and simple, element-local parameter choice. For linear problems, we show that our mass-scaling method produces no adverse effects in terms of spatial accuracy and orders of convergence. We illustrate these properties in one, two and three spatial dimension, for quadrilateral elements and triangular elements, and for up to fourth order polynomials basis functions. To extend the method to non-linear problems, we introduce a linear approximation and show that a sizeable increase in critical time-step size can be achieved while only causing minor (even beneficial) influences on the dynamic response.
The paper provides three results for SVARs under the assumption that the primitive shocks are mutually independent. First, a framework is proposed to accommodate a disaster-type variable with infinite variance into a VAR. We show that the least squares estimates of the VAR are consistent but have non-standard properties. Second, the disaster shock is identified as the component with the largest kurtosis and whose impact effect is negative. An estimator that is robust to infinite variance is used to recover the mutually independent components. Third, an independence test on the residuals pre-whitened by Choleski decomposition is proposed to test the restrictions imposed on a SVAR. The test can be applied whether the data have fat or thin tails, and to over as well as exactly identified models. Three applications are considered. In the first, the independence test is used to shed light on the conflicting evidence regarding the role of uncertainty in economic fluctuations. In the second, disaster shocks are shown to have short term economic impact arising mostly from feedback dynamics. The third application uses the framework to study the dynamic effects of economic shocks post-covid.
The recursive and hierarchical structure of full rooted trees is applicable to represent statistical models in various areas, such as data compression, image processing, and machine learning. In most of these cases, the full rooted tree is not a random variable; as such, model selection to avoid overfitting becomes problematic. A method to solve this problem is to assume a prior distribution on the full rooted trees. This enables the optimal model selection based on the Bayes decision theory. For example, by assigning a low prior probability to a complex model, the maximum a posteriori estimator prevents the selection of the complex one. Furthermore, we can average all the models weighted by their posteriors. In this paper, we propose a probability distribution on a set of full rooted trees. Its parametric representation is suitable for calculating the properties of our distribution using recursive functions, such as the mode, expectation, and posterior distribution. Although such distributions have been proposed in previous studies, they are only applicable to specific applications. Therefore, we extract their mathematically essential components and derive new generalized methods to calculate the expectation, posterior distribution, etc.
In countries where population census and sample survey data are limited, generating accurate subnational estimates of health and demographic indicators is challenging. Existing model-based geostatistical methods leverage covariate information and spatial smoothing to reduce the variability of estimates but often assume the survey design is ignorable, which may be inappropriate given the complex design of household surveys typically used in this context. On the other hand, small area estimation approaches common in the survey statistics literature do not incorporate both unit-level covariate information and spatial smoothing in a design-consistent way. We propose a new smoothed model-assisted estimator that accounts for survey design and leverages both unit-level covariates and spatial smoothing, bridging the survey statistics and model-based geostatistics perspectives. Under certain assumptions, the new estimator can be viewed as both design-consistent and model-consistent, offering potential benefits from both perspectives. We demonstrate our estimator's performance using both real and simulated data, comparing it with existing design-based and model-based estimators.
The design of privacy mechanisms for two scenarios is studied where the private data is hidden or observable. In the first scenario, an agent observes useful data $Y$, which is correlated with private data $X$, and wants to disclose the useful information to a user. A privacy mechanism is employed to generate data $U$ that maximizes the revealed information about $Y$ while satisfying a privacy criterion. In the second scenario, the agent has additionally access to the private data. To this end, the Functional Representation Lemma and Strong Functional Representation Lemma are extended relaxing the independence condition and thereby allowing a certain leakage. Lower bounds on privacy-utility trade-off are derived for the second scenario as well as upper bounds for both scenarios. In particular, for the case where no leakage is allowed, our upper and lower bounds improve previous bounds.
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models -- from simple linear models to deep neural networks -- have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance. Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.