Modern statistical analyses often encounter datasets with massive sizes and heavy-tailed distributions. For datasets with massive sizes, traditional estimation methods can hardly be used to estimate the extreme value index directly. To address the issue, we propose here a subsampling-based method. Specifically, multiple subsamples are drawn from the whole dataset by using the technique of simple random subsampling with replacement. Based on each subsample, an approximate maximum likelihood estimator can be computed. The resulting estimators are then averaged to form a more accurate one. Under appropriate regularity conditions, we show theoretically that the proposed estimator is consistent and asymptotically normal. With the help of the estimated extreme value index, we can estimate high-level quantiles and tail probabilities of a heavy-tailed random variable consistently. Extensive simulation experiments are provided to demonstrate the promising performance of our method. A real data analysis is also presented for illustration purpose.
Consider a $p$-dimensional population ${\mathbf x} \in\mathbb{R}^p$ with iid coordinates in the domain of attraction of a stable distribution with index $\alpha\in (0,2)$. Since the variance of ${\mathbf x}$ is infinite, the sample covariance matrix ${\mathbf S}_n=n^{-1}\sum_{i=1}^n {{\mathbf x}_i}{\mathbf x}'_i$ based on a sample ${\mathbf x}_1,\ldots,{\mathbf x}_n$ from the population is not well behaved and it is of interest to use instead the sample correlation matrix ${\mathbf R}_n= \{\operatorname{diag}({\mathbf S}_n)\}^{-1/2}\, {\mathbf S}_n \{\operatorname{diag}({\mathbf S}_n)\}^{-1/2}$. This paper finds the limiting distributions of the eigenvalues of ${\mathbf R}_n$ when both the dimension $p$ and the sample size $n$ grow to infinity such that $p/n\to \gamma \in (0,\infty)$. The family of limiting distributions $\{H_{\alpha,\gamma}\}$ is new and depends on the two parameters $\alpha$ and $\gamma$. The moments of $H_{\alpha,\gamma}$ are fully identified as sum of two contributions: the first from the classical Mar\v{c}enko-Pastur law and a second due to heavy tails. Moreover, the family $\{H_{\alpha,\gamma}\}$ has continuous extensions at the boundaries $\alpha=2$ and $\alpha=0$ leading to the Mar\v{c}enko-Pastur law and a modified Poisson distribution, respectively. Our proofs use the method of moments, the path-shortening algorithm developed in [18] and some novel graph counting combinatorics. As a consequence, the moments of $H_{\alpha,\gamma}$ are expressed in terms of combinatorial objects such as Stirling numbers of the second kind. A simulation study on these limiting distributions $H_{\alpha,\gamma}$ is also provided for comparison with the Mar\v{c}enko-Pastur law.
In this work, the problem of 4 degree-of-freedom (3D position and heading) robot-to-robot relative frame transformation estimation using onboard odometry and inter-robot distance measurements is studied. Firstly, we present a theoretical analysis of the problem, namely the derivation and interpretation of the Cramer-Rao Lower Bound (CRLB), the Fisher Information Matrix (FIM) and its determinant. Secondly, we propose optimization-based methods to solve the problem, including a quadratically constrained quadratic programming (QCQP) and the corresponding semidefinite programming (SDP) relaxation. Moreover, we address practical issues that are ignored in previous works, such as accounting for spatial-temporal offsets between the ultra-wideband (UWB) and odometry sensors, rejecting UWB outliers and checking for singular configurations before commencing operation. Lastly, extensive simulations and real-life experiments with aerial robots show that the proposed QCQP and SDP methods outperform state-of-the-art methods, especially in geometrically poor or large measurement noise conditions. In general, the QCQP method provides the best results at the expense of computational time, while the SDP method runs much faster and is sufficiently accurate in most cases.
Recently, high dimensional vector auto-regressive models (VAR), have attracted a lot of interest, due to novel applications in the health, engineering and social sciences. The presence of temporal dependence poses additional challenges to the theory of penalized estimation techniques widely used in the analysis of their iid counterparts. However, recent work (e.g., [Basu and Michailidis, 2015, Kock and Callot, 2015]) has established optimal consistency of $\ell_1$-LASSO regularized estimates applied to models involving high dimensional stable, Gaussian processes. The only price paid for temporal dependence is an extra multiplicative factor that equals 1 for independent and identically distributed (iid) data. Further, [Wong et al., 2020] extended these results to heavy tailed VARs that exhibit "$\beta$-mixing" dependence, but the rates rates are sub-optimal, while the extra factor is intractable. This paper improves these results in two important directions: (i) We establish optimal consistency rates and corresponding finite sample bounds for the underlying model parameters that match those for iid data, modulo a price for temporal dependence, that is easy to interpret and equals 1 for iid data. (ii) We incorporate more general penalties in estimation (which are not decomposable unlike the $\ell_1$ norm) to induce general sparsity patterns. The key technical tool employed is a novel, easy-to-use concentration bound for heavy tailed linear processes, that do not rely on "mixing" notions and give tighter bounds.
In data science, vector autoregression (VAR) models are popular in modeling multivariate time series in the environmental sciences and other applications. However, these models are computationally complex with the number of parameters scaling quadratically with the number of time series. In this work, we propose a so-called neighborhood vector autoregression (NVAR) model to efficiently analyze large-dimensional multivariate time series. We assume that the time series have underlying neighborhood relationships, e.g., spatial or network, among them based on the inherent setting of the problem. When this neighborhood information is available or can be summarized using a distance matrix, we demonstrate that our proposed NVAR method provides a computationally efficient and theoretically sound estimation of model parameters. The performance of the proposed method is compared with other existing approaches in both simulation studies and a real application of stream nitrogen study.
The $h$-index is a metric used to measure the impact of a user in a publication setting, such as a member of a social network with many highly liked posts or a researcher in an academic domain with many highly cited publications. Specifically, the $h$-index of a user is the largest integer $h$ such that at least $h$ publications of the user have at least $h$ units of positive feedback. We design an algorithm that, given query access to the $n$ publications of a user and each publication's corresponding positive feedback number, outputs a $(1\pm \varepsilon)$-approximation of the $h$-index of this user with probability at least $1-\delta$ in time \[ O(\frac{n \cdot \ln{(1/\delta)}}{\varepsilon^2 \cdot h}), \] where $h$ is the actual $h$-index which is unknown to the algorithm a-priori. We then design a novel lower bound technique that allows us to prove that this bound is in fact asymptotically optimal for this problem in all parameters $n,h,\varepsilon,$ and $\delta$. Our work is one of the first in sublinear time algorithms that addresses obtaining asymptotically optimal bounds, especially in terms of the error and confidence parameters. As such, we focus on designing novel techniques for this task. In particular, our lower bound technique seems quite general -- to showcase this, we also use our approach to prove an asymptotically optimal lower bound for the problem of estimating the number of triangles in a graph in sublinear time, which now is also optimal in the error and confidence parameters. This result improves upon prior lower bounds of Eden, Levi, Ron, and Seshadhri (FOCS'15) for this problem, as well as multiple follow-ups that extended this lower bound to other subgraph counting problems.
A goodness-of-fit test for one-parameter count distributions with finite second moment is proposed. The test statistic is derived from the $L_1$-distance of a function of the probability generating function of the model under the null hypothesis and that of the random variable actually generating data, when the latter belongs to a suitable wide class of alternatives. The test statistic has a rather simple form and it is asymptotically normally distributed under the null hypothesis, allowing a straightforward implementation of the test. Moreover, the test is consistent for alternative distributions belonging to the class, but also for all the alternative distributions whose probability of zero is different from that under the null hypothesis. Thus, the use of the test is proposed and investigated also for alternatives not in the class. The finite-sample properties of the test are assessed by means of an extensive simulation study.
Interval-censored multi-state data arise in many studies of chronic diseases, where the health status of a subject can be characterized by a finite number of disease states and the transition between any two states is only known to occur over a broad time interval. We formulate the effects of potentially time-dependent covariates on multi-state processes through semiparametric proportional intensity models with random effects. We adopt nonparametric maximum likelihood estimation (NPMLE) under general interval censoring and develop a stable expectation-maximization (EM) algorithm. We show that the resulting parameter estimators are consistent and that the finite-dimensional components are asymptotically normal with a covariance matrix that attains the semiparametric efficiency bound and can be consistently estimated through profile likelihood. In addition, we demonstrate through extensive simulation studies that the proposed numerical and inferential procedures perform well in realistic settings. Finally, we provide an application to a major epidemiologic cohort study.
In this paper, we investigate the matrix estimation problem in the multi-response regression model with measurement errors. A nonconvex error-corrected estimator based on a combination of the amended loss function and the nuclear norm regularizer is proposed to estimate the matrix parameter. Then under the (near) low-rank assumption, we analyse statistical and computational theoretical properties of global solutions of the nonconvex regularized estimator from a general point of view. In the statistical aspect, we establish the nonasymptotic recovery bound for any global solution of the nonconvex estimator, under restricted strong convexity on the loss function. In the computational aspect, we solve the nonconvex optimization problem via the proximal gradient method. The algorithm is proved to converge to a near-global solution and achieve a linear convergence rate. In addition, we also verify sufficient conditions for the general results to be held, in order to obtain probabilistic consequences for specific types of measurement errors, including the additive noise and missing data. Finally, theoretical consequences are demonstrated by several numerical experiments on corrupted errors-in-variables multi-response regression models. Simulation results reveal excellent consistency with our theory under high-dimensional scaling.
Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used $K$-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based $K$-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based $K$-means can achieve better classification performance over the standard centroid-based $K$-means for clustering probability distributions and images.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.