The bootstrap is a popular data-driven method to quantify statistical uncertainty, but for modern high-dimensional problems, it could suffer from huge computational costs due to the need to repeatedly generate resamples and refit models. Recent work has shown that it is possible to reduce the resampling effort dramatically, even down to one Monte Carlo replication, for constructing asymptotically valid confidence intervals. We derive finite-sample coverage error bounds for these ``cheap'' bootstrap confidence intervals that shed light on their behaviors for large-scale problems where the curb of resampling effort is important. Our results show that the cheap bootstrap using a small number of resamples has comparable coverages as traditional bootstraps using infinite resamples, even when the dimension grows closely with the sample size. We validate our theoretical results and compare the performances of the cheap bootstrap with other benchmarks via a range of experiments.
Mainstream methods for clinical trial design do not yet use prior probabilities of clinical hypotheses, mainly due to a concern that poor priors may lead to weak designs. To address this concern, we illustrate a conservative approach to trial design ensuring that the frequentist operational characteristics of the primary trial outcome are stronger than the design prior. Compared to current approaches to Bayesian design, we focus on defining a sample size cost commensurate to the prior to ensure against the possibility of prior-data conflict. Our approach is ethical, in that it calls for quantification of the level of clinical equipoise at design stage and requires the design to be appropriate to disturb this initial equipoise by a pre-specified amount. Four examples are discussed, illustrating the design of phase II-III trials with binary or time to event endpoints. Sample sizes are shown to be conductive to strong levels of overall evidence, whether positive or negative, increasing the conclusiveness of the design and associated trial outcome. Levels of negative evidence provided by standard group sequential designs are found negligible, underscoring the importance of complementing traditional efficacy boundaries with futility rules.
A factor copula model is proposed in which factors are either simulable or estimable from exogenous information. Point estimation and inference are based on a simulated methods of moments (SMM) approach with non-overlapping simulation draws. Consistency and limiting normality of the estimator is established and the validity of bootstrap standard errors is shown. Doing so, previous results from the literature are verified under low-level conditions imposed on the individual components of the factor structure. Monte Carlo evidence confirms the accuracy of the asymptotic theory in finite samples and an empirical application illustrates the usefulness of the model to explain the cross-sectional dependence between stock returns.
This paper proposes a Kolmogorov-Smirnov type statistic and a Cram\'er-von Mises type statistic to test linearity in semi-functional partially linear regression models. Our test statistics are based on a residual marked empirical process indexed by a randomly projected functional covariate,which is able to circumvent the "curse of dimensionality" brought by the functional covariate. The asymptotic properties of the proposed test statistics under the null, the fixed alternative, and a sequence of local alternatives converging to the null at the $n^{1/2}$ rate are established. A straightforward wild bootstrap procedure is suggested to estimate the critical values that are required to carry out the tests in practical applications. Results from an extensive simulation study show that our tests perform reasonably well in finite samples.Finally, we apply our tests to the Tecator and AEMET datasets to check whether the assumption of linearity is supported by these datasets.
This paper makes 3 contributions. First, it generalizes the Lindeberg\textendash Feller and Lyapunov Central Limit Theorems to Hilbert Spaces by way of $L^2$. Second, it generalizes these results to spaces in which sample failure and missingness can occur. Finally, it shows that satisfaction of the Lindeberg\textendash Feller and Lyapunov Conditions in such spaces implies the satisfaction of the conditions in the completely observed space, and how this guarantees the consistency of inferences from the partial functional data. These latter two results are especially important given the increasing attention to statistical inference with partially observed functional data. This paper goes beyond previous research by providing simple boundedness conditions which guarantee that \textit{all} inferences, as opposed to some proper subset of them, will be consistently estimated. This is shown primarily by aggregating conditional expectations with respect to the space of missingness patterns. This paper appears to be the first to apply this technique.
We study the problem of learning unknown parameters in stochastic interacting particle systems with polynomial drift, interaction and diffusion functions from the path of one single particle in the system. Our estimator is obtained by solving a linear system which is constructed by imposing appropriate conditions on the moments of the invariant distribution of the mean field limit and on the quadratic variation of the process. Our approach is easy to implement as it only requires the approximation of the moments via the ergodic theorem and the solution of a low-dimensional linear system. Moreover, we prove that our estimator is asymptotically unbiased in the limits of infinite data and infinite number of particles (mean field limit). In addition, we present several numerical experiments that validate the theoretical analysis and show the effectiveness of our methodology to accurately infer parameters in systems of interacting particles.
This paper describes three methods for carrying out non-asymptotic inference on partially identified parameters that are solutions to a class of optimization problems. Applications in which the optimization problems arise include estimation under shape restrictions, estimation of models of discrete games, and estimation based on grouped data. The partially identified parameters are characterized by restrictions that involve the unknown population means of observed random variables in addition to structural parameters. Inference consists of finding confidence intervals for functions of the structural parameters. Our theory provides finite-sample lower bounds on the coverage probabilities of the confidence intervals under three sets of assumptions of increasing strength. With the moderate sample sizes found in most economics applications, the bounds become tighter as the assumptions strengthen. We discuss estimation of population parameters that the bounds depend on and contrast our methods with alternative methods for obtaining confidence intervals for partially identified parameters. The results of Monte Carlo experiments and empirical examples illustrate the usefulness of our method.
We study properties of confidence intervals (CIs) for the difference of two Bernoulli distributions' success parameters, $p_x - p_y$, in the case where the goal is to obtain a CI of a given half-width while minimizing sampling costs when the observation costs may be different between the two distributions. Assuming that we are provided with preliminary estimates of the success parameters, we propose three different methods for constructing fixed-width CIs: (i) a two-stage sampling procedure, (ii) a sequential method that carries out sampling in batches, and (iii) an $\ell$-stage "look-ahead" procedure. We use Monte Carlo simulation to show that, under diverse success probability and observation cost scenarios, our proposed algorithms obtain significant cost savings versus their baseline counterparts (up to 50\% for the two-stage procedure, up to 15\% for the sequential methods). Furthermore, for the battery of scenarios under study, our sequential-batches and $\ell$-stage "look-ahead" procedures approximately obtain the nominal coverage while also meeting the desired width requirement. Our sequential-batching method turned out to be more efficient than the "look-ahead" method from a computational standpoint, with average running times at least an order-of-magnitude faster over all the scenarios tested.
Models such as finite state automata are widely used to abstract the behavior of software systems by capturing the sequences of events observable during their execution. Nevertheless, models rarely exist in practice and, when they do, get easily outdated; moreover, manually building and maintaining models is costly and error-prone. As a result, a variety of model inference methods that automatically construct models from execution traces have been proposed to address these issues. However, performing a systematic and reliable accuracy assessment of inferred models remains an open problem. Even when a reference model is given, most existing model accuracy assessment methods may return misleading and biased results. This is mainly due to their reliance on statistical estimators over a finite number of randomly generated traces, introducing avoidable uncertainty about the estimation and being sensitive to the parameters of the random trace generative process. This paper addresses this problem by developing a systematic approach based on analytic combinatorics that minimizes bias and uncertainty in model accuracy assessment by replacing statistical estimation with deterministic accuracy measures. We experimentally demonstrate the consistency and applicability of our approach by assessing the accuracy of models inferred by state-of-the-art inference tools against reference models from established specification mining benchmarks.
We consider the constrained sampling problem where the goal is to sample from a distribution $\pi(x)\propto e^{-f(x)}$ and $x$ is constrained on a convex body $\mathcal{C}\subset \mathbb{R}^d$. Motivated by penalty methods from optimization, we propose penalized Langevin Dynamics (PLD) and penalized Hamiltonian Monte Carlo (PHMC) that convert the constrained sampling problem into an unconstrained one by introducing a penalty function for constraint violations. When $f$ is smooth and the gradient is available, we show $\tilde{\mathcal{O}}(d/\varepsilon^{10})$ iteration complexity for PLD to sample the target up to an $\varepsilon$-error where the error is measured in terms of the total variation distance and $\tilde{\mathcal{O}}(\cdot)$ hides some logarithmic factors. For PHMC, we improve this result to $\tilde{\mathcal{O}}(\sqrt{d}/\varepsilon^{7})$ when the Hessian of $f$ is Lipschitz and the boundary of $\mathcal{C}$ is sufficiently smooth. To our knowledge, these are the first convergence rate results for Hamiltonian Monte Carlo methods in the constrained sampling setting that can handle non-convex $f$ and can provide guarantees with the best dimension dependency among existing methods with deterministic gradients. We then consider the setting where unbiased stochastic gradients are available. We propose PSGLD and PSGHMC that can handle stochastic gradients without Metropolis-Hasting correction steps. When $f$ is strongly convex and smooth, we obtain an iteration complexity of $\tilde{\mathcal{O}}(d/\varepsilon^{18})$ and $\tilde{\mathcal{O}}(d\sqrt{d}/\varepsilon^{39})$ respectively in the 2-Wasserstein distance. For the more general case, when $f$ is smooth and non-convex, we also provide finite-time performance bounds and iteration complexity results. Finally, we test our algorithms on Bayesian LASSO regression and Bayesian constrained deep learning problems.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.