In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.
In prevalent cohort studies with follow-up, the time-to-event outcome is subject to left truncation leading to selection bias. For estimation of the distribution of time-to-event, conventional methods adjusting for left truncation tend to rely on the (quasi-)independence assumption that the truncation time and the event time are "independent" on the observed region. This assumption is violated when there is dependence between the truncation time and the event time possibly induced by measured covariates. Inverse probability of truncation weighting leveraging covariate information can be used in this case, but it is sensitive to misspecification of the truncation model. In this work, we apply the semiparametric theory to find the efficient influence curve of an expected (arbitrarily transformed) survival time in the presence of covariate-induced dependent left truncation. We then use it to construct estimators that are shown to enjoy double-robustness properties. Our work represents the first attempt to construct doubly robust estimators in the presence of left truncation, which does not fall under the established framework of coarsened data where doubly robust approaches are developed. We provide technical conditions for the asymptotic properties that appear to not have been carefully examined in the literature for time-to-event data, and study the estimators via extensive simulation. We apply the estimators to two data sets from practice, with different right-censoring patterns.
We present a unified technique for sequential estimation of convex divergences between distributions, including integral probability metrics like the kernel maximum mean discrepancy, $\varphi$-divergences like the Kullback-Leibler divergence, and optimal transport costs, such as powers of Wasserstein distances. This is achieved by observing that empirical convex divergences are (partially ordered) reverse submartingales with respect to the exchangeable filtration, coupled with maximal inequalities for such processes. These techniques appear to be complementary and powerful additions to the existing literature on both confidence sequences and convex divergences. We construct an offline-to-sequential device that converts a wide array of existing offline concentration inequalities into time-uniform confidence sequences that can be continuously monitored, providing valid tests or confidence intervals at arbitrary stopping times. The resulting sequential bounds pay only an iterated logarithmic price over the corresponding fixed-time bounds, retaining the same dependence on problem parameters (like dimension or alphabet size if applicable). These results are also applicable to more general convex functionals -- like the negative differential entropy, suprema of empirical processes, and V-Statistics -- and to more general processes satisfying a key leave-one-out property.
This paper presents a new approach for the estimation and inference of the regression parameters in a panel data model with interactive fixed effects. It relies on the assumption that the factor loadings can be expressed as an unknown smooth function of the time average of covariates plus an idiosyncratic error term. Compared to existing approaches, our estimator has a simple partial least squares form and does neither require iterative procedures nor the previous estimation of factors. We derive its asymptotic properties by finding out that the limiting distribution has a discontinuity, depending on the explanatory power of our basis functions which is expressed by the variance of the error of the factor loadings. As a result, the usual ``plug-in" methods based on estimates of the asymptotic covariance are only valid pointwise and may produce either over- or under-coverage probabilities. We show that uniformly valid inference can be achieved by using the cross-sectional bootstrap. A Monte Carlo study indicates good performance in terms of mean squared error. We apply our methodology to analyze the determinants of growth rates in OECD countries.
In this manuscript, we study the problem of scalar-on-distribution regression; that is, instances where subject-specific distributions or densities, or in practice, repeated measures from those distributions, are the covariates related to a scalar outcome via a regression model. We propose a direct regression for such distribution-valued covariates that circumvents estimating subject-specific densities and directly uses the observed repeated measures as covariates. The model is invariant to any transformation or ordering of the repeated measures. Endowing the regression function with a Gaussian Process prior, we obtain closed form or conjugate Bayesian inference. Our method subsumes the standard Bayesian non-parametric regression using Gaussian Processes as a special case. Theoretically, we show that the method can achieve an optimal estimation error bound. To our knowledge, this is the first theoretical study on Bayesian regression using distribution-valued covariates. Through simulation studies and analysis of activity count dataset, we demonstrate that our method performs better than approaches that require an intermediate density estimation step.
Making causal inferences from observational studies can be challenging when confounders are missing not at random. In such cases, identifying causal effects is often not guaranteed. Motivated by a real example, we consider a treatment-independent missingness assumption under which we establish the identification of causal effects when confounders are missing not at random. We propose a weighted estimating equation (WEE) approach for estimating model parameters and introduce three estimators for the average causal effect, based on regression, propensity score weighting, and doubly robust methods. We evaluate the performance of these estimators through simulations, and provide a real data analysis to illustrate our proposed method.
In this paper, we establish novel deviation bounds for additive functionals of geometrically ergodic Markov chains similar to Rosenthal and Bernstein-type inequalities for sums of independent random variables. We pay special attention to the dependence of our bounds on the mixing time of the corresponding chain. Our proof technique is, as far as we know, new and based on the recurrent application of the Poisson decomposition. We relate the constants appearing in our moment bounds to the constants from the martingale version of the Rosenthal inequality and show an explicit dependence on the parameters of the underlying Markov kernel.
$P$-values that are derived from continuously distributed test statistics are typically uniformly distributed on $(0,1)$ under least favorable parameter configurations (LFCs) in the null hypothesis. Conservativeness of a $p$-value $P$ (meaning that $P$ is under the null hypothesis stochastically larger than a random variable which is uniformly distributed on $(0,1)$) can occur if the test statistic from which $P$ is derived is discrete, or if the true parameter value under the null is not an LFC. To deal with both of these sources of conservativeness, we present two approaches utilizing randomized $p$-values, namely single-stage and two-stage randomization. We illustrate their effectiveness for testing a composite null hypothesis under a binomial model. We also give an example of how the proposed $p$-values can be used to test a composite null in group testing designs. Similar to previous findings, we find that the proposed randomized $p$-values are less conservative compared to non-randomized $p$-values under the null hypothesis, but that they are stochastically not smaller under the alternative. The problem of establishing the validity of randomized $p$-values is not trivial and has received attention in previous literature. We show that our proposed randomized $p$-values are valid under various discrete statistical models which are such that the distribution of the corresponding test statistic belongs to an exponential family. The behaviour of the power function for the tests based on the proposed randomized $p$-values as a function of the sample size is also investigated. Simulations and a real data analysis are used to compare the different considered $p$-values.
The goal of radiation therapy for cancer is to deliver prescribed radiation dose to the tumor while minimizing dose to the surrounding healthy tissues. To evaluate treatment plans, the dose distribution to healthy organs is commonly summarized as dose-volume histograms (DVHs). Normal tissue complication probability (NTCP) modelling has centered around making patient-level risk predictions with features extracted from the DVHs, but few have considered adapting a causal framework to evaluate the comparative effectiveness of alternative treatment plans. We propose causal estimands for NTCP based on deterministic and stochastic interventions, as well as propose estimators based on marginal structural models that parametrize the biologically necessary bivariable monotonicity between dose, volume, and toxicity risk. The properties of these estimators are studied through simulations, along with an illustration of their use in the context of anal canal cancer patients treated with radiotherapy.
In this paper, we propose a method for estimating model parameters using Small-Angle Scattering (SAS) data based on the Bayesian inference. Conventional SAS data analyses involve processes of manual parameter adjustment by analysts or optimization using gradient methods. These analysis processes tend to involve heuristic approaches and may lead to local solutions.Furthermore, it is difficult to evaluate the reliability of the results obtained by conventional analysis methods. Our method solves these problems by estimating model parameters as probability distributions from SAS data using the framework of the Bayesian inference. We evaluate the performance of our method through numerical experiments using artificial data of representative measurement target models.From the results of the numerical experiments, we show that our method provides not only high accuracy and reliability of estimation, but also perspectives on the transition point of estimability with respect to the measurement time and the lower bound of the angular domain of the measured data.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.