Heavy-tailed error distributions and predictors with anomalous values are ubiquitous in high-dimensional regression problems and can seriously jeopardize the validity of statistical analyses if not properly addressed. For more reliable estimation under these adverse conditions, we propose a new robust regularized estimator for simultaneous variable selection and coefficient estimation. This estimator, called adaptive PENSE, possesses the oracle property without prior knowledge of the scale of the residuals and without any moment conditions on the error distribution. The proposed estimator gives reliable results even under very heavy-tailed error distributions and aberrant contamination in the predictors or residuals. Importantly, even in these challenging settings variable selection by adaptive PENSE remains stable. Numerical studies on simulated and real data sets highlight superior finite-sample performance in a vast range of settings compared to other robust regularized estimators in the case of contaminated samples and competitiveness compared to classical regularized estimators in clean samples.
We present new estimators for the statistical analysis of the dependence of the mean gap time length between consecutive recurrent events, on a set of explanatory random variables and in the presence of right censoring. The dependence is expressed through regression-like and overdispersion parameters, estimated via conditional estimating equations. The mean and variance of the length of each gap time, conditioned on the observed history of prior events and other covariates, are known functions of parameters and covariates. Under certain conditions on censoring, we construct normalized estimating functions that are asymptotically unbiased and contain only observed data. We discuss the existence, consistency and asymptotic normality of a sequence of estimators of the parameters, which are roots of these estimating equations. Simulations suggest that our estimators could be used successfully with a relatively small sample size in a study of short duration.
In more and more applications, a quantity of interest may depend on several covariates, with at least one of them infinite-dimensional (e.g. a curve). To select the relevant covariates in this context, we propose an adaptation of the Lasso method. Two estimation methods are defined. The first one consists in the minimisation of a criterion inspired by classical Lasso inference under group sparsity (Yuan and Lin, 2006; Lounici et al., 2011) on the whole multivariate functional space H. The second one minimises the same criterion but on a finite-dimensional subspace of H which dimension is chosen by a penalized leasts-squares method base on the work of Barron et al. (1999). Sparsity-oracle inequalities are proven in case of fixed or random design in our infinite-dimensional context. To calculate the solutions of both criteria, we propose a coordinate-wise descent algorithm, inspired by the glmnet algorithm (Friedman et al., 2007). A numerical study on simulated and experimental datasets illustrates the behavior of the estimators.
We introduce and illustrate through numerical examples the R package \texttt{SIHR} which handles the statistical inference for (1) linear and quadratic functionals in the high-dimensional linear regression and (2) linear functional in the high-dimensional logistic regression. The focus of the proposed algorithms is on the point estimation, confidence interval construction and hypothesis testing. The inference methods are extended to multiple regression models. We include real data applications to demonstrate the package's performance and practicality.
This paper aims to address two fundamental challenges arising in eigenvector estimation and inference for a low-rank matrix from noisy observations: (1) how to estimate an unknown eigenvector when the eigen-gap (i.e. the spacing between the associated eigenvalue and the rest of the spectrum) is particularly small; (2) how to perform estimation and inference on linear functionals of an eigenvector -- a sort of "fine-grained" statistical reasoning that goes far beyond the usual $\ell_2$ analysis. We investigate how to address these challenges in a setting where the unknown $n\times n$ matrix is symmetric and the additive noise matrix contains independent (and non-symmetric) entries. Based on eigen-decomposition of the asymmetric data matrix, we propose estimation and uncertainty quantification procedures for an unknown eigenvector, which further allow us to reason about linear functionals of an unknown eigenvector. The proposed procedures and the accompanying theory enjoy several important features: (1) distribution-free (i.e. prior knowledge about the noise distributions is not needed); (2) adaptive to heteroscedastic noise; (3) minimax optimal under Gaussian noise. Along the way, we establish optimal procedures to construct confidence intervals for the unknown eigenvalues. All this is guaranteed even in the presence of a small eigen-gap (up to $O(\sqrt{n/\mathrm{poly}\log (n)})$ times smaller than the requirement in prior theory), which goes significantly beyond what generic matrix perturbation theory has to offer.
Regression models in survival analysis based on homogeneous and inhomogeneous phase-type distributions are proposed. The intensity function in this setting plays the role of the hazard function, having the additional benefit of being interpreted in terms of a hidden Markov structure. For unidimensional intensity matrices, we recover the proportional hazards and accelerated failure time models, among others. However, when considering higher dimensions, the proposed methods are only asymptotically equivalent to their classical counterparts and enjoy greater flexibility in the body of the distribution. For their estimation, the latent path representation of inhomogeneous Markov models is exploited. Consequently, an adapted EM algorithm is provided for which the likelihood increases at each iteration. Several examples of practical significance and relevant extensions are examined. The practical feasibility of the models is illustrated on simulated and real-world datasets.
Estimating the unknown density from which a given independent sample originates is more difficult than estimating the mean, in the sense that for the best popular non-parametric density estimators, the mean integrated square error converges more slowly than at the canonical rate of $\mathcal{O}(1/n)$. When the sample is generated from a simulation model and we have control over how this is done, we can do better. We examine an approach in which conditional Monte Carlo yields, under certain conditions, a random conditional density which is an unbiased estimator of the true density at any point. By averaging independent replications, we obtain a density estimator that converges at a faster rate than the usual ones. Moreover, combining this new type of estimator with randomized quasi-Monte Carlo to generate the samples typically brings a larger improvement on the error and convergence rate than for the usual estimators, because the new estimator is smoother as a function of the underlying uniform random numbers.
The aim of this paper is to study the asymptotic behavior of a particular multivariate risk measure, the Covariate-Conditional-Tail-Expectation (CCTE), based on a multivariate statistical depth function. Depth functions have become increasingly powerful tools in nonparametric inference for multivariate data, as they measure a degree of centrality of a point with respect to a distribution. A multivariate risks scenario is then represented by a depth-based lower level set of the risk factors, meaning that we consider a non-compact setting. More precisely, given a multivariate depth function D associated to a fixed probability measure, we are interested in the lower level set based on D. First, we present a plug-in approach in order to estimate the depth-based level set. In a second part, we provide a consistent estimator of our CCTE for a general depth function with a rate of convergence, and we consider the particular case of the Mahalanobis depth. A simulation study complements the performances of our estimator.
Network data are increasingly available in various research fields, motivating statistical analysis for populations of networks where a network as a whole is viewed as a data point. Due to the non-Euclidean nature of networks, basic statistical tools available for scalar and vector data are no longer applicable when one aims to relate networks as outcomes to Euclidean covariates, while the study of how a network changes in dependence on covariates is often of paramount interest. This motivates to extend the notion of regression to the case of responses that are network data. Here we propose to adopt conditional Fr\'{e}chet means implemented with both global least squares regression and local weighted least squares smoothing, extending the Fr\'{e}chet regression concept to networks that are quantified by their graph Laplacians. The challenge is to characterize the space of graph Laplacians so as to justify the application of Fr\'{e}chet regression. This characterization then leads to asymptotic rates of convergence for the corresponding M-estimators by applying empirical process methods. We demonstrate the usefulness and good practical performance of the proposed framework with simulations and with network data arising from NYC taxi records, as well as resting-state fMRI in neuroimaging.
Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a $d$-dimensional manifold isometrically embedded in $\mathbb{R}^D$, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of $n^{-\frac{s}{2s+2(s\vee d)}}$, where $s$ is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension $d$, instead of the data dimension $D$. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.
Heatmap-based methods dominate in the field of human pose estimation by modelling the output distribution through likelihood heatmaps. In contrast, regression-based methods are more efficient but suffer from inferior performance. In this work, we explore maximum likelihood estimation (MLE) to develop an efficient and effective regression-based methods. From the perspective of MLE, adopting different regression losses is making different assumptions about the output density function. A density function closer to the true distribution leads to a better regression performance. In light of this, we propose a novel regression paradigm with Residual Log-likelihood Estimation (RLE) to capture the underlying output distribution. Concretely, RLE learns the change of the distribution instead of the unreferenced underlying distribution to facilitate the training process. With the proposed reparameterization design, our method is compatible with off-the-shelf flow models. The proposed method is effective, efficient and flexible. We show its potential in various human pose estimation tasks with comprehensive experiments. Compared to the conventional regression paradigm, regression with RLE bring 12.4 mAP improvement on MSCOCO without any test-time overhead. Moreover, for the first time, especially on multi-person pose estimation, our regression method is superior to the heatmap-based methods. Our code is available at //github.com/Jeff-sjtu/res-loglikelihood-regression