Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of $L_2$Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called \textquotedblleft post-Boosting\textquotedblright. This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage by $L_2$Boosting. Another variant is \textquotedblleft Orthogonal Boosting\textquotedblright\ where after each step an orthogonal projection is conducted. We show that both post-$L_2$Boosting and the orthogonal boosting achieve the same rate of convergence as LASSO in a sparse, high-dimensional setting. We show that the rate of convergence of the classical $L_2$Boosting depends on the design matrix described by a sparse eigenvalue constant. To show the latter results, we derive new approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of $L_2$Boosting. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Our results also allow a direct comparison between LASSO and boosting which has been missing from the literature. Finally, we present simulation studies and applications to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, post-$L_2$Boosting clearly outperforms LASSO.
This paper studies the impact of bootstrap procedure on the eigenvalue distributions of the sample covariance matrix under the high-dimensional factor structure. We provide asymptotic distributions for the top eigenvalues of bootstrapped sample covariance matrix under mild conditions. After bootstrap, the spiked eigenvalues which are driven by common factors will converge weakly to Gaussian limits via proper scaling and centralization. However, the largest non-spiked eigenvalue is mainly determined by order statistics of bootstrap resampling weights, and follows extreme value distribution. Based on the disparate behavior of the spiked and non-spiked eigenvalues, we propose innovative methods to test the number of common factors. According to the simulations and a real data example, the proposed methods are the only ones performing reliably and convincingly under the existence of both weak factors and cross-sectionally correlated errors. Our technical details contribute to random matrix theory on spiked covariance model with convexly decaying density and unbounded support, or with general elliptical distributions.
Mixtures of experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.
Consider words of length $n$. The set of all periods of a word of length $n$ is a subset of $\{0,1,2,\ldots,n-1\}$. However, any subset of $\{0,1,2,\ldots,n-1\}$ is not necessarily a valid set of periods. In a seminal paper in 1981, Guibas and Odlyzko have proposed to encode the set of periods of a word into an $n$ long binary string, called an autocorrelation, where a one at position $i$ denotes a period of $i$. They considered the question of recognizing a valid period set, and also studied the number of valid period sets for length $n$, denoted $\kappa_n$. They conjectured that $\ln(\kappa_n)$ asymptotically converges to a constant times $\ln^2(n)$. If improved lower bounds for $\ln(\kappa_n)/\ln^2(n)$ were proposed in 2001, the question of a tight upper bound has remained opened since Guibas and Odlyzko's paper. Here, we exhibit an upper bound for this fraction, which implies its convergence and closes this long standing conjecture. Moreover, we extend our result to find similar bounds for the number of correlations: a generalization of autocorrelations which encodes the overlaps between two strings.
Recently, high dimensional vector auto-regressive models (VAR), have attracted a lot of interest, due to novel applications in the health, engineering and social sciences. The presence of temporal dependence poses additional challenges to the theory of penalized estimation techniques widely used in the analysis of their iid counterparts. However, recent work (e.g., [Basu and Michailidis, 2015, Kock and Callot, 2015]) has established optimal consistency of $\ell_1$-LASSO regularized estimates applied to models involving high dimensional stable, Gaussian processes. The only price paid for temporal dependence is an extra multiplicative factor that equals 1 for independent and identically distributed (iid) data. Further, [Wong et al., 2020] extended these results to heavy tailed VARs that exhibit "$\beta$-mixing" dependence, but the rates rates are sub-optimal, while the extra factor is intractable. This paper improves these results in two important directions: (i) We establish optimal consistency rates and corresponding finite sample bounds for the underlying model parameters that match those for iid data, modulo a price for temporal dependence, that is easy to interpret and equals 1 for iid data. (ii) We incorporate more general penalties in estimation (which are not decomposable unlike the $\ell_1$ norm) to induce general sparsity patterns. The key technical tool employed is a novel, easy-to-use concentration bound for heavy tailed linear processes, that do not rely on "mixing" notions and give tighter bounds.
Two aspects of neural networks that have been extensively studied in the recent literature are their function approximation properties and their training by gradient descent methods. The approximation problem seeks accurate approximations with a minimal number of weights. In most of the current literature these weights are fully or partially hand-crafted, showing the capabilities of neural networks but not necessarily their practical performance. In contrast, optimization theory for neural networks heavily relies on an abundance of weights in over-parametrized regimes. This paper balances these two demands and provides an approximation result for shallow networks in $1d$ with non-convex weight optimization by gradient descent. We consider finite width networks and infinite sample limits, which is the typical setup in approximation theory. Technically, this problem is not over-parametrized, however, some form of redundancy reappears as a loss in approximation rate compared to best possible rates.
As a special infinite-order vector autoregressive (VAR) model, the vector autoregressive moving average (VARMA) model can capture much richer temporal patterns than the widely used finite-order VAR model. However, its practicality has long been hindered by its non-identifiability, computational intractability, and relative difficulty of interpretation. This paper introduces a novel infinite-order VAR model which, with only a little sacrifice of generality, inherits the essential temporal patterns of the VARMA model but avoids all of the above drawbacks. As another attractive feature, the temporal and cross-sectional dependence structures of this model can be interpreted separately, since they are characterized by different sets of parameters. For high-dimensional time series, this separation motivates us to impose sparsity on the parameters determining the cross-sectional dependence. As a result, greater statistical efficiency and interpretability can be achieved, while no loss of temporal information is incurred by the imposed sparsity. We introduce an $\ell_1$-regularized estimator for the proposed model and derive the corresponding nonasymptotic error bounds. An efficient block coordinate descent algorithm and a consistent model order selection method are developed. The merit of the proposed approach is supported by simulation studies and a real-world macroeconomic data analysis.
We introduce two new tools to assess the validity of statistical distributions. These tools are based on components derived from a new statistical quantity, the $comparison$ $curve$. The first tool is a graphical representation of these components on a $bar$ $plot$ (B plot), which can provide a detailed appraisal of the validity of the statistical model, in particular when supplemented by acceptance regions related to the model. The knowledge gained from this representation can sometimes suggest an existing $goodness$-$of$-$fit$ test to supplement this visual assessment with a control of the type I error. Otherwise, an adaptive test may be preferable and the second tool is the combination of these components to produce a powerful $\chi^2$-type goodness-of-fit test. Because the number of these components can be large, we introduce a new selection rule to decide, in a data driven fashion, on their proper number to take into consideration. In a simulation, our goodness-of-fit tests are seen to be powerwise competitive with the best solutions that have been recommended in the context of a fully specified model as well as when some parameters must be estimated. Practical examples show how to use these tools to derive principled information about where the model departs from the data.
We investigate $L_2$ boosting in the context of kernel regression. Kernel smoothers, in general, lack appealing traits like symmetry and positive definiteness, which are critical not only for understanding theoretical aspects but also for achieving good practical performance. We consider a projection-based smoother (Huang and Chen, 2008) that is symmetric, positive definite, and shrinking. Theoretical results based on the orthonormal decomposition of the smoother reveal additional insights into the boosting algorithm. In our asymptotic framework, we may replace the full-rank smoother with a low-rank approximation. We demonstrate that the smoother's low-rank ($d(n)$) is bounded above by $O(h^{-1})$, where $h$ is the bandwidth. Our numerical findings show that, in terms of prediction accuracy, low-rank smoothers may outperform full-rank smoothers. Furthermore, we show that the boosting estimator with low-rank smoother achieves the optimal convergence rate. Finally, to improve the performance of the boosting algorithm in the presence of outliers, we propose a novel robustified boosting algorithm which can be used with any smoother discussed in the study. We investigate the numerical performance of the proposed approaches using simulations and a real-world case.
This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.