In the recent COVID-19 pandemic, a wide range of epidemiological modelling approaches have been used to predict the effective reproduction number, R(t), and other COVID-19 related measures such as the daily rate of exponential growth, r(t). These candidate models use different modelling approaches or differing assumptions about spatial or age mixing, and some capture genuine uncertainty in scientific understanding of disease dynamics. Combining estimates using appropriate statistical methodology from multiple candidate models is important to better understand the variation of these outcome measures to help inform decision making. In this paper, we combine these estimates for specific UK nations and regions using random effects meta analyses techniques, utilising the restricted maximum likelihood (REML) method to estimate the heterogeneity variance parameter, and two approaches to calculate the confidence interval for the combined estimate: the standard Wald-type intervals; and the Knapp and Hartung (KNHA) method. As estimates in this setting are derived using model predictions, each with varying degrees of uncertainty, equal weighting is favoured over the more standard inverse-variance weighting in order avoid potential up-weighting of models providing estimates with lower levels of uncertainty that are not fully accounting for inherent uncertainties. Utilising these meta-analysis techniques has allowed for statistically robust combined estimates to be calculated for key COVID-19 outcome measures. This in turn allows timely and informed decision making based on all of the available information.
Motivated by the case fatality rate (CFR) of COVID-19, in this paper, we develop a fully parametric quantile regression model based on the generalized three-parameter beta (GB3) distribution. Beta regression models are primarily used to model rates and proportions. However, these models are usually specified in terms of a conditional mean. Therefore, they may be inadequate if the observed response variable follows an asymmetrical distribution, such as CFR data. In addition, beta regression models do not consider the effect of the covariates across the spectrum of the dependent variable, which is possible through the conditional quantile approach. In order to introduce the proposed GB3 regression model, we first reparameterize the GB3 distribution by inserting a quantile parameter and then we develop the new proposed quantile model. We also propose a simple interpretation of the predictor-response relationship in terms of percentage increases/decreases of the quantile. A Monte Carlo study is carried out for evaluating the performance of the maximum likelihood estimates and the choice of the link functions. Finally, a real COVID-19 dataset from Chile is analyzed and discussed to illustrate the proposed approach.
We introduce a minimalist outbreak forecasting model that combines data-driven parameter estimation with variational data assimilation. By focusing on the fundamental components of nonlinear disease transmission and representing data in a domain where model stochasticity simplifies into a process with independent increments, we design an approach that only requires four core parameters to be estimated. We illustrate this novel methodology on COVID-19 forecasts. Results include case count and deaths predictions for the US and all of its 50 states, the District of Columbia, and Puerto Rico. The method is computationally efficient and is not disease- or location-specific. It may therefore be applied to other outbreaks or other countries, provided case counts and/or deaths data are available.
Spatial process models popular in geostatistics often represent the observed data as the sum of a smooth underlying process and white noise. The variation in the white noise is attributed to measurement error, or micro-scale variability, and is called the "nugget". We formally establish results on the identifiability and consistency of the nugget in spatial models based upon the Gaussian process within the framework of in-fill asymptotics, i.e. the sample size increases within a sampling domain that is bounded. Our work extends results in fixed domain asymptotics for spatial models without the nugget. More specifically, we establish the identifiability of parameters in the Mat\'ern covariance function and the consistency of their maximum likelihood estimators in the presence of discontinuities due to the nugget. We also present simulation studies to demonstrate the role of the identifiable quantities in spatial interpolation.
Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We provide a suite of experiments on synthetic and real health data that demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and the method's robustness to plausible violations of the covariate shift assumption.
We propose a framework for estimation and inference when the model may be misspecified. We rely on a local asymptotic approach where the degree of misspecification is indexed by the sample size. We construct estimators whose mean squared error is minimax in a neighborhood of the reference model, based on one-step adjustments. In addition, we provide confidence intervals that contain the true parameter under local misspecification. As a tool to interpret the degree of misspecification, we map it to the local power of a specification test of the reference model. Our approach allows for systematic sensitivity analysis when the parameter of interest may be partially or irregularly identified. As illustrations, we study three applications: an empirical analysis of the impact of conditional cash transfers in Mexico where misspecification stems from the presence of stigma effects of the program, a cross-sectional binary choice model where the error distribution is misspecified, and a dynamic panel data binary choice model where the number of time periods is small and the distribution of individual effects is misspecified.
We develop a post-selective Bayesian framework to jointly and consistently estimate parameters in group-sparse linear regression models. After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), uncertainty estimates for the selected parameters are unreliable in the absence of adjustments for selection bias. Existing post-selective approaches are limited to uncertainty estimation for (i) real-valued projections onto very specific selected subspaces for the group-sparse problem, (ii) selection events categorized broadly as polyhedral events that are expressible as linear inequalities in the data variables. Our Bayesian methods address these gaps by deriving a likelihood adjustment factor, and an approximation thereof, that eliminates bias from selection. Paying a very nominal price for this adjustment, experiments on simulated data, and data from the Human Connectome Project demonstrate the efficacy of our methods for a joint estimation of group-sparse parameters and their uncertainties post selection.
We present a means of formulating and solving the well known structure-and-motion problem in computer vision with probabilistic graphical models. We model the unknown camera poses and 3D feature coordinates as well as the observed 2D projections as Gaussian random variables, using sigma point parameterizations to effectively linearize the nonlinear relationships between these variables. Those variables involved in every projection are grouped into a cluster, and we connect the clusters in a cluster graph. Loopy belief propagation is performed over this graph, in an iterative re-initialization and estimation procedure, and we find that our approach shows promise in both simulation and on real-world data. The PGM is easily extendable to include additional parameters or constraints.
Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. We explore this tradeoff for an estimator based on a shallow NN by means of non-asymptotic error bounds, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. The bounds reveal the tension between the NN size and the number of samples, and enable to characterize scaling rates thereof that ensure consistency. For compactly supported distributions, we further show that neural estimators with a slightly different NN growth-rate are near minimax rate-optimal, achieving the parametric convergence rate up to logarithmic factors.
This paper studies the non-asymptotic merits of the double $\ell_1$-regularized for heterogeneous overdispersed count data via negative binomial regressions. Under the restricted eigenvalue conditions, we prove the oracle inequalities for Lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, derived from the oracle inequalities, the consistency and convergence rate for the estimators are the theoretical guarantees for further statistical inference. Finally, both simulations and a real data analysis demonstrate that the new methods are effective.
Within the vast body of statistical theory developed for binary classification, few meaningful results exist for imbalanced classification, in which data are dominated by samples from one of the two classes. Existing theory faces at least two main challenges. First, meaningful results must consider more complex performance measures than classification accuracy. To address this, we characterize a novel generalization of the Bayes-optimal classifier to any performance metric computed from the confusion matrix, and we use this to show how relative performance guarantees can be obtained in terms of the error of estimating the class probability function under uniform ($\mathcal{L}_\infty$) loss. Second, as we show, optimal classification performance depends on certain properties of class imbalance that have not previously been formalized. Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class Imbalance influences optimal classifier performance and show that it necessitates different classifier behavior than other types of class imbalance. We further illustrate these two contributions in the case of $k$-nearest neighbor classification, for which we develop novel guarantees. Together, these results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.