We establish precise structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. Specifically, we prove that linear and quadratic functionals of subsample ridge estimators, when fitted with different ridge regularization levels $\lambda$ and subsample aspect ratios $\psi$, are asymptotically equivalent along specific paths in the $(\lambda, \psi )$-plane (where $\psi$ is the ratio of the feature dimension to the subsample size). Our results only require bounded moment assumptions on feature and response distributions and allow for arbitrary joint distributions. Furthermore, we provide a datadependent method to determine the equivalent paths of $(\lambda, \psi )$. An indirect implication of our equivalences is that optimally-tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio. This resolves a recent open problem raised by Nakkiran et al. under general data distributions and a mild regularity condition that maintains regression hardness through linearized signal-to-noise ratios.
We consider the problem of learning a sparse graph underlying an undirected Gaussian graphical model, a key problem in statistical machine learning. Given $n$ samples from a multivariate Gaussian distribution with $p$ variables, the goal is to estimate the $p \times p$ inverse covariance matrix (aka precision matrix), assuming it is sparse (i.e., has a few nonzero entries). We propose GraphL0BnB, a new estimator based on an $\ell_0$-penalized version of the pseudolikelihood function, while most earlier approaches are based on the $\ell_1$-relaxation. Our estimator can be formulated as a convex mixed integer program (MIP) which can be difficult to compute at scale using off-the-shelf commercial solvers. To solve the MIP, we propose a custom nonlinear branch-and-bound (BnB) framework that solves node relaxations with tailored first-order methods. As a by-product of our BnB framework, we propose large-scale solvers for obtaining good primal solutions that are of independent interest. We derive novel statistical guarantees (estimation and variable selection) for our estimator and discuss how our approach improves upon existing estimators. Our numerical experiments on real/synthetic datasets suggest that our method can solve, to near-optimality, problem instances with $p = 10^4$ -- corresponding to a symmetric matrix of size $p \times p$ with $p^2/2$ binary variables. We demonstrate the usefulness of GraphL0BnB versus various state-of-the-art approaches on a range of datasets.
Nonparametric estimators for the mean and the covariance functions of functional data are proposed. The setup covers a wide range of practical situations. The random trajectories are, not necessarily differentiable, have unknown regularity, and are measured with error at discrete design points. The measurement error could be heteroscedastic. The design points could be either randomly drawn or common for all curves. The estimators depend on the local regularity of the stochastic process generating the functional data. We consider a simple estimator of this local regularity which exploits the replication and regularization features of functional data. Next, we use the ``smoothing first, then estimate'' approach for the mean and the covariance functions. They can be applied with both sparsely or densely sampled curves, are easy to calculate and to update, and perform well in simulations. Simulations built upon an example of real data set, illustrate the effectiveness of the new approach.
In this article we perform an asymptotic analysis of parallel Bayesian logspline density estimators. Such estimators are useful for the analysis of datasets that are partitioned into subsets and stored in separate databases without the capability of accessing the full dataset from a single computer. The parallel estimator we introduce is in the spirit of a kernel density estimator introduced in recent studies. We provide a numerical procedure that produces the normalized density estimator itself in place of the sampling algorithm. We then derive an error bound for the mean integrated squared error of the full dataset posterior estimator. The error bound depends upon the parameters that arise in logspline density estimation and the numerical approximation procedure. In our analysis, we identify the choices for the parameters that result in the error bound scaling optimally in relation to the number of samples. This provides our method with increased estimation accuracy, while also minimizing the computational cost.
This paper introduces a novel self-consistency clustering algorithm ($K$-Tensors) designed for {partitioning a distribution of} positive-semidefinite matrices based on their eigenstructures. As positive semi-definite matrices can be represented as ellipsoids in $\Re^p$, $p \ge 2$, it is critical to maintain their structural information to perform effective clustering. However, traditional clustering algorithms {applied to matrices} often {involve vectorization of} the matrices, resulting in a loss of essential structural information. To address this issue, we propose a distance metric {for clustering} that is specifically based on the structural information of positive semi-definite matrices. This distance metric enables the clustering algorithm to consider the differences between positive semi-definite matrices and their projections onto {a} common space spanned by \thadJulyTen{orthonormal vectors defined from a set of} positive semi-definite matrices. This innovative approach to clustering positive semi-definite matrices has broad applications in several domains including financial and biomedical research, such as analyzing functional connectivity data. By maintaining the structural information of positive semi-definite matrices, our proposed algorithm promises to cluster the positive semi-definite matrices in a more meaningful way, thereby facilitating deeper insights into the underlying data in various applications.
We study subsampling-based ridge ensembles in the proportional asymptotics regime, where the feature size grows proportionally with the sample size such that their ratio converges to a constant. By analyzing the squared prediction risk of ridge ensembles as a function of the explicit penalty $\lambda$ and the limiting subsample aspect ratio $\phi_s$ (the ratio of the feature size to the subsample size), we characterize contours in the $(\lambda, \phi_s)$-plane at any achievable risk. As a consequence, we prove that the risk of the optimal full ridgeless ensemble (fitted on all possible subsamples) matches that of the optimal ridge predictor. In addition, we prove strong uniform consistency of generalized cross-validation (GCV) over the subsample sizes for estimating the prediction risk of ridge ensembles. This allows for GCV-based tuning of full ridgeless ensembles without sample splitting and yields a predictor whose risk matches optimal ridge risk.
Sequential tests and their implied confidence sequences, which are valid at arbitrary stopping times, promise flexible statistical inference and on-the-fly decision making. However, strong guarantees are limited to parametric sequential tests that under-cover in practice or concentration-bound-based sequences that over-cover and have suboptimal rejection times. In this work, we consider \cite{robbins1970boundary}'s delayed-start normal-mixture sequential probability ratio tests, and we provide the first asymptotic type-I-error and expected-rejection-time guarantees under general non-parametric data generating processes, where the asymptotics are indexed by the test's burn-in time. The type-I-error results primarily leverage a martingale strong invariance principle and establish that these tests (and their implied confidence sequences) have type-I error rates approaching a desired $\alpha$-level. The expected-rejection-time results primarily leverage an identity inspired by It\^o's lemma and imply that, in certain asymptotic regimes, the expected rejection time approaches the minimum possible among $\alpha$-level tests. We show how to apply our results to sequential inference on parameters defined by estimating equations, such as average treatment effects. Together, our results establish these (ostensibly parametric) tests as general-purpose, non-parametric, and near-optimal. We illustrate this via numerical experiments.
In recent decades, a number of ways of dealing with causality in practice, such as propensity score matching, the PC algorithm and invariant causal prediction, have been introduced. Besides its interpretational appeal, the causal model provides the best out-of-sample prediction guarantees. In this paper, we study the identification of causal-like models from in-sample data that provide out-of-sample risk guarantees when predicting a target variable from a set of covariates. Whereas ordinary least squares provides the best in-sample risk with limited out-of-sample guarantees, causal models have the best out-of-sample guarantees but achieve an inferior in-sample risk. By defining a trade-off of these properties, we introduce $\textit{causal regularization}$. As the regularization is increased, it provides estimators whose risk is more stable across sub-samples at the cost of increasing their overall in-sample risk. The increased risk stability is shown to lead to out-of-sample risk guarantees. We provide finite sample risk bounds for all models and prove the adequacy of cross-validation for attaining these bounds.
Despite widespread adoption in practice, guarantees for the LASSO and Group LASSO are strikingly lacking in settings beyond statistical problems, and these algorithms are usually considered to be a heuristic in the context of sparse convex optimization on deterministic inputs. We give the first recovery guarantees for the Group LASSO for sparse convex optimization with vector-valued features. We show that if a sufficiently large Group LASSO regularization is applied when minimizing a strictly convex function $l$, then the minimizer is a sparse vector supported on vector-valued features with the largest $\ell_2$ norm of the gradient. Thus, repeating this procedure selects the same set of features as the Orthogonal Matching Pursuit algorithm, which admits recovery guarantees for any function $l$ with restricted strong convexity and smoothness via weak submodularity arguments. This answers open questions of Tibshirani et al. and Yasuda et al. Our result is the first to theoretically explain the empirical success of the Group LASSO for convex functions under general input instances assuming only restricted strong convexity and smoothness. Our result also generalizes provable guarantees for the Sequential Attention algorithm, which is a feature selection algorithm inspired by the attention mechanism proposed by Yasuda et al. As an application of our result, we give new results for the column subset selection problem, which is well-studied when the loss is the Frobenius norm or other entrywise matrix losses. We give the first result for general loss functions for this problem that requires only restricted strong convexity and smoothness.
Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.
The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most $n$, simple dynamic programming computes their edit distance exactly in $O(n^2)$ time, which is also the best possible (up to subpolynomial factors) assuming the Strong Exponential Time Hypothesis (SETH). The last few decades have seen tremendous progress in edit distance approximation, where the runtime has been brought down to subquadratic, near-linear, and even sublinear at the cost of approximation. In this paper, we study the dynamic edit distance problem, where the strings change dynamically as the characters are substituted, inserted, or deleted over time. Each change may happen at any location of either of the two strings. The goal is to maintain the (exact or approximate) edit distance of such dynamic strings while minimizing the update time. The exact edit distance can be maintained in $\tilde{O}(n)$ time per update (Charalampopoulos, Kociumaka, Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the unprecedented progress in edit distance approximation in the static setting, strikingly little is known regarding dynamic edit distance approximation. Utilizing the off-the-shelf tools, it is possible to achieve an $O(n^{c})$-approximation in $n^{0.5-c+o(1)}$ update time for any constant $c\in [0,\frac16]$. Improving upon this trade-off remains open. The contribution of this work is a dynamic $n^{o(1)}$-approximation algorithm with amortized expected update time of $n^{o(1)}$. In other words, we bring the approximation-ratio and update-time product down to $n^{o(1)}$. Our solution utilizes an elegant framework of precision sampling tree for edit distance approximation (Andoni, Krauthgamer, Onak; 2010).