In this paper, we establish minimax optimal rates of convergence for prediction in a semi-functional linear model that consists of a functional component and a less smooth nonparametric component. Our results reveal that the smoother functional component can be learned with the minimax rate as if the nonparametric component were known. More specifically, a double-penalized least squares method is adopted to estimate both the functional and nonparametric components within the framework of reproducing kernel Hilbert spaces. By virtue of the representer theorem, an efficient algorithm that requires no iterations is proposed to solve the corresponding optimization problem, where the regularization parameters are selected by the generalized cross validation criterion. Numerical studies are provided to demonstrate the effectiveness of the method and to verify the theoretical analysis.
Let $X$ be a random variable with unknown mean and finite variance. We present a new estimator of the mean of $X$ that is robust with respect to the possible presence of outliers in the sample, provides tight sub-Gaussian deviation guarantees without any additional assumptions on the shape or tails of the distribution, and moreover is asymptotically efficient. This is the first estimator that provably combines all these qualities in one package. Our construction is inspired by robustness properties possessed by the self-normalized sums. Theoretical findings are supplemented by numerical simulations highlighting strong performance of the proposed estimator in comparison with previously known techniques.
We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work.
The cost of both generalized least squares (GLS) and Gibbs sampling in a crossed random effects model can easily grow faster than $N^{3/2}$ for $N$ observations. Ghosh et al. (2020) develop a backfitting algorithm that reduces the cost to $O(N)$. Here we extend that method to a generalized linear mixed model for logistic regression. We use backfitting within an iteratively reweighted penalized least square algorithm. The specific approach is a version of penalized quasi-likelihood due to Schall (1991). A straightforward version of Schall's algorithm would also cost more than $N^{3/2}$ because it requires the trace of the inverse of a large matrix. We approximate that quantity at cost $O(N)$ and prove that this substitution makes an asymptotically negligible difference. Our backfitting algorithm also collapses the fixed effect with one random effect at a time in a way that is analogous to the collapsed Gibbs sampler of Papaspiliopoulos et al. (2020). We use a symmetric operator that facilitates efficient covariance computation. We illustrate our method on a real dataset from Stitch Fix. By properly accounting for crossed random effects we show that a naive logistic regression could underestimate sampling variances by several hundred fold.
This study concerns probability distribution estimation of sample maximum. The traditional approach is the parametric fitting to the limiting distribution - the generalized extreme value distribution; however, the model in finite cases is misspecified to a certain extent. We propose a plug-in type of the kernel distribution estimator which does not need model specification. It is proved that both asymptotic convergence rates depend on the tail index and the second order parameter. As the tail gets light, the degree of misspecification of the parametric fitting becomes large, that means the convergence rate becomes slow. In the Weibull cases, which can be seen as the limit of tail-lightness, only the nonparametric distribution estimator keeps its consistency. Finally, we report results of numerical experiments and two real case studies.
This work studies an experimental design problem where $x$'s are to be selected with the goal of estimating a function $m(x)$, which is observed with noise. A linear model is fitted to $m(x)$ but it is not assumed that the model is correctly specified. It follows that the quantity of interest is the best linear approximation of $m(x)$, which is denoted by $\ell(x)$. It is shown that in this framework the ordinary least squares estimator typically leads to an inconsistent estimation of $\ell(x)$, and rather weighted least squares should be considered. An asymptotic minimax criterion is formulated for this estimator, and a design that minimizes the criterion is constructed. An important feature of this problem is that the $x$'s should be random, rather than fixed. Otherwise, the minimax risk is infinite. It is shown that the optimal random minimax design is different from its deterministic counterpart, which was studied previously, and a simulation study indicates that it generally performs better when $m(x)$ is a quadratic or a cubic function. Another finding is that when the variance of the noise goes to infinity, the random and deterministic minimax designs coincide. The results are illustrated for polynomial regression models and different generalizations are presented.
Modern-day problems in statistics often face the challenge of exploring and analyzing complex non-Euclidean object data that do not conform to vector space structures or operations. Examples of such data objects include covariance matrices, graph Laplacians of networks, and univariate probability distribution functions. In the current contribution a new concurrent regression model is proposed to characterize the time-varying relation between an object in a general metric space (as a response) and a vector in $\reals^p$ (as a predictor), where concepts from Fr\'echet regression is employed. Concurrent regression has been a well-developed area of research for Euclidean predictors and responses, with many important applications for longitudinal studies and functional data. However, there is no such model available so far for general object data as responses. We develop generalized versions of both global least squares regression and locally weighted least squares smoothing in the context of concurrent regression for responses that are situated in general metric spaces and propose estimators that can accommodate sparse and/or irregular designs. Consistency results are demonstrated for sample estimates of appropriate population targets along with the corresponding rates of convergence. The proposed models are illustrated with human mortality data and resting state functional Magnetic Resonance Imaging data (fMRI) as responses.
For supervised classification problems, this paper considers estimating the query's label probability through local regression using observed covariates. Well-known nonparametric kernel smoother and $k$-nearest neighbor ($k$-NN) estimator, which take label average over a ball around the query, are consistent but asymptotically biased particularly for a large radius of the ball. To eradicate such bias, local polynomial regression (LPoR) and multiscale $k$-NN (MS-$k$-NN) learn the bias term by local regression around the query and extrapolate it to the query itself. However, their theoretical optimality has been shown for the limit of the infinite number of training samples. For correcting the asymptotic bias with fewer observations, this paper proposes a local radial regression (LRR) and its logistic regression variant called local radial logistic regression (LRLR), by combining the advantages of LPoR and MS-$k$-NN. The idea is simple: we fit the local regression to observed labels by taking the radial distance as the explanatory variable and then extrapolate the estimated label probability to zero distance. Our numerical experiments, including real-world datasets of daily stock indices, demonstrate that LRLR outperforms LPoR and MS-$k$-NN.
Counterfactual explanations are usually generated through heuristics that are sensitive to the search's initial conditions. The absence of guarantees of performance and robustness hinders trustworthiness. In this paper, we take a disciplined approach towards counterfactual explanations for tree ensembles. We advocate for a model-based search aiming at "optimal" explanations and propose efficient mixed-integer programming approaches. We show that isolation forests can be modeled within our framework to focus the search on plausible explanations with a low outlier score. We provide comprehensive coverage of additional constraints that model important objectives, heterogeneous data types, structural constraints on the feature space, along with resource and actionability restrictions. Our experimental analyses demonstrate that the proposed search approach requires a computational effort that is orders of magnitude smaller than previous mathematical programming algorithms. It scales up to large data sets and tree ensembles, where it provides, within seconds, systematic explanations grounded on well-defined models solved to optimality.
Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.