Partial least squares (PLS) is a dimensionality reduction technique used as an alternative to ordinary least squares (OLS) in situations where the data is colinear or high dimensional. Both PLS and OLS provide mean based estimates, which are extremely sensitive to the presence of outliers or heavy tailed distributions. In contrast, quantile regression is an alternative to OLS that computes robust quantile based estimates. In this work, the multivariate PLS is extended to the quantile regression framework, obtaining a theoretical formulation of the problem and a robust dimensionality reduction technique that we call fast partial quantile regression (fPQR), that provides quantile based estimates. An efficient implementation of fPQR is also derived, and its performance is studied through simulation experiments and the chemometrics well known biscuit dough dataset, a real high dimensional example.
Motivated by investigating the relationship between progesterone and the days in a menstrual cycle in a longitudinal study, we propose a multi-kink quantile regression model for longitudinal data analysis. It relaxes the linearity condition and assumes different regression forms in different regions of the domain of the threshold covariate. In this paper, we first propose a multi-kink quantile regression for longitudinal data. Two estimation procedures are proposed to estimate the regression coefficients and the kink points locations: one is a computationally efficient profile estimator under the working independence framework while the other one considers the within-subject correlations by using the unbiased generalized estimation equation approach. The selection consistency of the number of kink points and the asymptotic normality of two proposed estimators are established. Secondly, we construct a rank score test based on partial subgradients for the existence of kink effect in longitudinal studies. Both the null distribution and the local alternative distribution of the test statistic have been derived. Simulation studies show that the proposed methods have excellent finite sample performance. In the application to the longitudinal progesterone data, we identify two kink points in the progesterone curves over different quantiles and observe that the progesterone level remains stable before the day of ovulation, then increases quickly in five to six days after ovulation and then changes to stable again or even drops slightly
We introduce a procedure for conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for statistical learning. On standard examples, this bound scales as $d/n$ with $d$ the model dimension and $n$ the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal $\log n$ factors and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of $O((d + B^2R^2)/n)$, where $R$ bounds the norm of features and $B$ that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than $\min({B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$ in general. This provides a more practical alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly addressing a question raised by Foster et al. (2018).
This paper provides some extended results on estimating the parameter matrix of high-dimensional regression model when the covariate or response possess weaker moment condition. We investigate the $M$-estimator of Fan et al. (Ann Stat 49(3):1239--1266, 2021) for matrix completion model with $(1+\epsilon)$-th moments. The corresponding phase transition phenomenon is observed. When $\epsilon \geq 1$, the robust estimator possesses the same convergence rate as previous literature. While $1> \epsilon>0$, the rate will be slower. For high dimensional multiple index coefficient model, we also apply the element-wise truncation method to construct a robust estimator which handle missing and heavy-tailed data with finite fourth moment.
Performing exact Bayesian inference for complex models is computationally intractable. Markov chain Monte Carlo (MCMC) algorithms can provide reliable approximations of the posterior distribution but are expensive for large datasets and high-dimensional models. A standard approach to mitigate this complexity consists in using subsampling techniques or distributing the data across a cluster. However, these approaches are typically unreliable in high-dimensional scenarios. We focus here on a recent alternative class of MCMC schemes exploiting a splitting strategy akin to the one used by the celebrated alternating direction of multipliers (ADMM) optimization algorithm. These methods appear to provide empirically state-of-the-art performance but their theoretical behavior in high dimension is currently unknown. In this paper, we propose a detailed theoretical study of one of these algorithms known as the split Gibbs sampler. Under regularity conditions, we establish explicit convergence rates for this scheme using Ricci curvature and coupling ideas. We support our theory with numerical illustrations.
Stochastic differential equations projected onto manifolds occur widely in physics, chemistry, biology, engineering, nanotechnology and optimization theory. In some problems one can use an intrinsic coordinate system on the manifold, but this is often computationally impractical. Numerical projections are preferable in many cases. We derive an algorithm to solve these, using adiabatic elimination and a constraining potential. We also review earlier proposed algorithms. Our hybrid midpoint projection algorithm uses a midpoint projection on a tangent manifold, combined with a normal projection to satisfy the constraints. We show from numerical examples on spheroidal and hyperboloidal surfaces that this has greatly reduced errors compared to earlier methods using either a hybrid Euler with tangential and normal projections or purely tangential derivative methods. Our technique can handle multiple constraints. This allows, for example, the treatment of manifolds that embody several conserved quantities. The resulting algorithm is accurate, relatively simple to implement and efficient.
Cross-validation is the standard approach for tuning parameter selection in many non-parametric regression problems. However its use is less common in change-point regression, perhaps as its prediction error-based criterion may appear to permit small spurious changes and hence be less well-suited to estimation of the number and location of change-points. We show that in fact the problems of cross-validation with squared error loss are more severe and can lead to systematic under- or over-estimation of the number of change-points, and highly suboptimal estimation of the mean function in simple settings where changes are easily detectable. We propose two simple approaches to remedy these issues, the first involving the use of absolute error rather than squared error loss, and the second involving modifying the holdout sets used. For the latter, we provide conditions that permit consistent estimation of the number of change-points for a general change-point estimation procedure. We show these conditions are satisfied for optimal partitioning using new results on its performance when supplied with the incorrect number of change-points. Numerical experiments show that the absolute error approach in particular is competitive with common change-point methods using classical tuning parameter choices when error distributions are well-specified, but can substantially outperform these in misspecified models. An implementation of our methodology is available in the R package crossvalidationCP on CRAN.
A main difficulty in actuarial claim size modeling is that there is no simple off-the-shelf distribution that simultaneously provides a good distributional model for the main body and the tail of the data. In particular, covariates may have different effects for small and for large claim sizes. To cope with this problem, we introduce a deep composite regression model whose splicing point is given in terms of a quantile of the conditional claim size distribution rather than a constant. To facilitate M-estimation for such models, we introduce and characterize the class of strictly consistent scoring functions for the triplet consisting a quantile, as well as the lower and upper expected shortfall beyond that quantile. In a second step, this elicitability result is applied to fit deep neural network regression models. We demonstrate the applicability of our approach and its superiority over classical approaches on a real accident insurance data set.
We propose three novel consistent specification tests for quantile regression models which generalize former tests in three ways. First, we allow the covariate effects to be quantile-dependent and nonlinear. Second, we allow parameterizing the conditional quantile functions by appropriate basis functions, rather than parametrically. We are hence able to test for functional forms beyond linearity, while retaining the linear effects as special cases. In both cases, the induced class of conditional distribution functions is tested with a Cram\'{e}r-von Mises type test statistic for which we derive the theoretical limit distribution and propose a bootstrap method. Third, to increase the power of the tests, we further suggest a modified test statistic. We highlight the merits of our tests in a detailed MC study and two real data examples. Our first application to conditional income distributions in Germany indicates that there are not only still significant differences between East and West but also across the quantiles of the conditional income distributions, when conditioning on age and year. The second application to data from the Australian national electricity market reveals the importance of using interaction effects for modelling the highly skewed and heavy-tailed distributions of energy prices conditional on day, time of day and demand.
Dynamic treatment regimes (DTRs) consist of a sequence of decision rules, one per stage of intervention, that finds effective treatments for individual patients according to patient information history. DTRs can be estimated from models which include the interaction between treatment and a small number of covariates which are often chosen a priori. However, with increasingly large and complex data being collected, it is difficult to know which prognostic factors might be relevant in the treatment rule. Therefore, a more data-driven approach of selecting these covariates might improve the estimated decision rules and simplify models to make them easier to interpret. We propose a variable selection method for DTR estimation using penalized dynamic weighted least squares. Our method has the strong heredity property, that is, an interaction term can be included in the model only if the corresponding main terms have also been selected. Through simulations, we show our method has both the double robustness property and the oracle property, and the newly proposed methods compare favorably with other variable selection approaches.
Data augmentation has been widely used for training deep learning systems for medical image segmentation and plays an important role in obtaining robust and transformation-invariant predictions. However, it has seldom been used at test time for segmentation and not been formulated in a consistent mathematical framework. In this paper, we first propose a theoretical formulation of test-time augmentation for deep learning in image recognition, where the prediction is obtained through estimating its expectation by Monte Carlo simulation with prior distributions of parameters in an image acquisition model that involves image transformations and noise. We then propose a novel uncertainty estimation method based on the formulated test-time augmentation. Experiments with segmentation of fetal brains and brain tumors from 2D and 3D Magnetic Resonance Images (MRI) showed that 1) our test-time augmentation outperforms a single-prediction baseline and dropout-based multiple predictions, and 2) it provides a better uncertainty estimation than calculating the model-based uncertainty alone and helps to reduce overconfident incorrect predictions.