We propose a penalized least-squares method to fit the linear regression model with fitted values that are invariant to invertible linear transformations of the design matrix. This invariance is important, for example, when practitioners have categorical predictors and interactions. Our method has the same computational cost as ridge-penalized least squares, which lacks this invariance. We derive the expected squared distance between the vector of population fitted values and its shrinkage estimator as well as the tuning parameter value that minimizes this expectation. In addition to using cross validation, we construct two estimators of this optimal tuning parameter value and study their asymptotic properties. Our numerical experiments and data examples show that our method performs similarly to ridge-penalized least-squares.
Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
We discuss recently developed methods that quantify the stability and generalizability of statistical findings under distributional changes. In many practical problems, the data is not drawn i.i.d. from the target population. For example, unobserved sampling bias, batch effects, or unknown associations might inflate the variance compared to i.i.d. sampling. For reliable statistical inference, it is thus necessary to account for these types of variation. We discuss and review two methods that allow quantifying distribution stability based on a single dataset. The first method computes the sensitivity of a parameter under worst-case distributional perturbations to understand which types of shift pose a threat to external validity. The second method treats distributional shifts as random which allows assessing average robustness (instead of worst-case). Based on a stability analysis of multiple estimators on a single dataset, it integrates both sampling and distributional uncertainty into a single confidence interval.
Approximated forms of the RII and RIII redistribution matrices are frequently applied to simplify the numerical solution of the radiative transfer problem for polarized radiation, taking partial frequency redistribution (PRD) effects into account. A widely used approximation for RIII is to consider its expression under the assumption of complete frequency redistribution (CRD) in the observer frame (RIII CRD). The adequacy of this approximation for modeling the intensity profiles has been firmly established. By contrast, its suitability for modeling scattering polarization signals has only been analyzed in a few studies, considering simplified settings. In this work, we aim at quantitatively assessing the impact and the range of validity of the RIII CRD approximation in the modeling of scattering polarization. Methods. We first present an analytic comparison between RIII and RIII CRD. We then compare the results of radiative transfer calculations, out of local thermodynamic equilibrium, performed with RIII and RIII CRD in realistic 1D atmospheric models. We focus on the chromospheric Ca i line at 4227 A and on the photospheric Sr i line at 4607 A.
The equioscillation condition is extended to multivariate approximation. To this end, it is reformulated as the synchronized oscillations between the error maximizers and the components of a related Haar matrix kernel vector. This new condition gives rise to a multivariate equioscillation theorem where the Haar condition is not assumed and hence the existence and the characterization by equioscillation become independent of uniqueness. This allows the theorem to be applicable to problems with no strong uniqueness or even no uniqueness. A technical additional requirement on the involved Haar matrix and its kernel vector is proved to be sufficient for strong uniqueness. Instances of multivariate problems with strongly unique, unique and nonunique solutions are presented to illustrate the scope of the theorem.
Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. In the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV.
In this paper, we propose nonlocal diffusion models with Dirichlet boundary. These nonlocal diffusion models preserve the maximum principle and also have corresponding variational form. With these good properties, It is relatively easy to prove the well-posedness and the vanishing nonlocality convergence. Furthermore, by specifically designed weight function, we can get a nonlocal diffusion model with second order convergence which is optimal for nonlocal diffusion models.
Branching process inspired models are widely used to estimate the effective reproduction number -- a useful summary statistic describing an infectious disease outbreak -- using counts of new cases. Case data is a real-time indicator of changes in the reproduction number, but is challenging to work with because cases fluctuate due to factors unrelated to the number of new infections. We develop a new model that incorporates the number of diagnostic tests as a surveillance model covariate. Using simulated data and data from the SARS-CoV-2 pandemic in California, we demonstrate that incorporating tests leads to improved performance over the state-of-the-art.
I propose an alternative algorithm to compute the MMS voting rule. Instead of using linear programming, in this new algorithm the maximin support value of a committee is computed using a sequence of maximum flow problems.
We introduce a flexible method to simultaneously infer both the drift and volatility functions of a discretely observed scalar diffusion. We introduce spline bases to represent these functions and develop a Markov chain Monte Carlo algorithm to infer, a posteriori, the coefficients of these functions in the spline basis. A key innovation is that we use spline bases to model transformed versions of the drift and volatility functions rather than the functions themselves. The output of the algorithm is a posterior sample of plausible drift and volatility functions that are not constrained to any particular parametric family. The flexibility of this approach provides practitioners a powerful investigative tool, allowing them to posit a variety of parametric models to better capture the underlying dynamics of their processes of interest. We illustrate the versatility of our method by applying it to challenging datasets from finance, paleoclimatology, and astrophysics. In view of the parametric diffusion models widely employed in the literature for those examples, some of our results are surprising since they call into question some aspects of these models.
We propose a variant of an algorithm introduced by Schewe and also studied by Luttenberger for solving parity or mean-payoff games. We present it over energy games and call our variant the ESL algorithm (for Energy-Schewe-Luttenberger). We find that using potential reductions as introduced by Gurvich, Karzanov and Khachiyan allows for a simple and elegant presentation of the algorithm, which repeatedly applies a natural generalisation of Dijkstra's algorithm to the two-player setting due to Khachiyan, Gurvich and Zhao.