Where the response variable in a big data set is consistent with the variable of interest for small area estimation, the big data by itself can provide the estimates for small areas. These estimates are often subject to the coverage and measurement error bias inherited from the big data. However, if a probability survey of the same variable of interest is available, the survey data can be used as a training data set to develop an algorithm to impute for the data missed by the big data and adjust for measurement errors. In this paper, we outline a methodology for such imputations based on an kNN algorithm calibrated to an asymptotically design-unbiased estimate of the national total and illustrate the use of a training data set to estimate the imputation bias and the fixed - asymptotic bootstrap to estimate the variance of the small area hybrid estimator. We illustrate the methodology of this paper using a public use data set and use it to compare the accuracy and precision of our hybrid estimator with the Fay-Harriot (FH) estimator. Finally, we also examine numerically the accuracy and precision of the FH estimator when the auxiliary variables used in the linking models are subject to under-coverage errors
Neural point estimators are neural networks that map data to parameter point estimates. They are fast, likelihood free and, due to their amortised nature, amenable to fast bootstrap-based uncertainty quantification. In this paper, we aim to increase the awareness of statisticians to this relatively new inferential tool, and to facilitate its adoption by providing user-friendly open-source software. We also give attention to the ubiquitous problem of making inference from replicated data, which we address in the neural setting using permutation-invariant neural networks. Through extensive simulation studies we show that these neural point estimators can quickly and optimally (in a Bayes sense) estimate parameters in weakly-identified and highly-parameterised models with relative ease. We demonstrate their applicability through an analysis of extreme sea-surface temperature in the Red Sea where, after training, we obtain parameter estimates and bootstrap-based confidence intervals from hundreds of spatial fields in a fraction of a second.
Two-component mixture models have proved to be a powerful tool for modeling heterogeneity in several cluster analysis contexts. However, most methods based on these models assume a constant behavior for the mixture weights, which can be restrictive and unsuitable for some applications. In this paper, we relax this assumption and allow the mixture weights to vary according to the index (e.g., time) to make the model more adaptive to a broader range of data sets. We propose an efficient MCMC algorithm to jointly estimate both component parameters and dynamic weights from their posterior samples. We evaluate the method's performance by running Monte Carlo simulation studies under different scenarios for the dynamic weights. In addition, we apply the algorithm to a time series that records the level reached by a river in southern Brazil. The Taquari River is a water body whose frequent flood inundations have caused various damage to riverside communities. Implementing a dynamic mixture model allows us to properly describe the flood regimes for the areas most affected by these phenomena.
Predictive power and generalizability of models depend on the quality of features selected in the model. Machine learning (ML) models in banks consider a large number of features which are often correlated or dependent. Incorporation of these features may hinder model stability and prior feature screening can improve long term performance of the models. A Markov boundary (MB) of features is the minimum set of features that guarantee that other potential predictors do not affect the target given the boundary while ensuring maximal predictive accuracy. Identifying the Markov boundary is straightforward under assumptions of Gaussianity on the features and linear relationships between them. This paper outlines common problems associated with identifying the Markov boundary in structured data when relationships are non-linear, and predictors are of mixed data type. We have proposed a multi-group forward-backward selection strategy that not only handles the continuous features but addresses some of the issues with MB identification in a mixed data setup and demonstrated its capabilities on simulated and real datasets.
Understanding how and why certain communities bear a disproportionate burden of disease is challenging due to the scarcity of data on these communities. Surveys provide a useful avenue for accessing hard-to-reach populations, as many surveys specifically oversample understudied and vulnerable populations. When survey data is used for analysis, it is important to account for the complex survey design that gave rise to the data, in order to avoid biased conclusions. The field of Bayesian survey statistics aims to account for such survey design while leveraging the advantages of Bayesian models, which can flexibly handle sparsity through borrowing of information and provide a coherent inferential framework to easily obtain variances for complex models and data types. For these reasons, Bayesian survey methods seem uniquely well-poised for health disparities research, where heterogeneity and sparsity are frequent considerations. This review discusses three main approaches found in the Bayesian survey methodology literature: 1) multilevel regression and post-stratification, 2) weighted pseudolikelihood-based methods, and 3) synthetic population generation. We discuss advantages and disadvantages of each approach, examine recent applications and extensions, and consider how these approaches may be leveraged to improve research in population health equity.
Gaussian graphical models are nowadays commonly applied to the comparison of groups sharing the same variables, by jointy learning their independence structures. We consider the case where there are exactly two dependent groups and the association structure is represented by a family of coloured Gaussian graphical models suited to deal with paired data problems. To learn the two dependent graphs, together with their across-graph association structure, we implement a fused graphical lasso penalty. We carry out a comprehensive analysis of this approach, with special attention to the role played by some relevant submodel classes. In this way, we provide a broad set of tools for the application of Gaussian graphical models to paired data problems. These include results useful for the specification of penalty values in order to obtain a path of lasso solutions and an ADMM algorithm that solves the fused graphical lasso optimization problem. Finally, we present an application of our method to cancer genomics where it is of interest to compare cancer cells with a control sample from histologically normal tissues adjacent to the tumor. All the methods described in this article are implemented in the $\texttt{R}$ package $\texttt{pdglasso}$ availabe at: //github.com/savranciati/pdglasso.
Machine learning models often need to be robust to noisy input data. The effect of real-world noise (which is often random) on model predictions is captured by a model's local robustness, i.e., the consistency of model predictions in a local region around an input. However, the na\"ive approach to computing local robustness based on Monte-Carlo sampling is statistically inefficient, leading to prohibitive computational costs for large-scale applications. In this work, we develop the first analytical estimators to efficiently compute local robustness of multi-class discriminative models using local linear function approximation and the multivariate Normal CDF. Through the derivation of these estimators, we show how local robustness is connected to concepts such as randomized smoothing and softmax probability. We also confirm empirically that these estimators accurately and efficiently compute the local robustness of standard deep learning models. In addition, we demonstrate these estimators' usefulness for various tasks involving local robustness, such as measuring robustness bias and identifying examples that are vulnerable to noise perturbation in a dataset. By developing these analytical estimators, this work not only advances conceptual understanding of local robustness, but also makes its computation practical, enabling the use of local robustness in critical downstream applications.
In this paper, we propose the Minimum Regularized Covariance Trace (MRCT) estimator, a novel method for robust covariance estimation and functional outlier detection. The MRCT estimator employs a subset-based approach that prioritizes subsets exhibiting greater centrality based on the generalization of the Mahalanobis distance, resulting in a fast-MCD type algorithm. Notably, the MRCT estimator handles high-dimensional data sets without the need for preprocessing or dimension reduction techniques, due to the internal smoothening whose amount is determined by the regularization parameter $\alpha > 0$. The selection of the regularization parameter $\alpha$ is automated. The proposed method adapts seamlessly to sparsely observed data by working directly with the finite matrix of basis coefficients. An extensive simulation study demonstrates the efficacy of the MRCT estimator in terms of robust covariance estimation and automated outlier detection, emphasizing the balance between noise exclusion and signal preservation achieved through appropriate selection of $\alpha$. The method converges fast in practice and performs favorably when compared to other functional outlier detection methods.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
Randomized controlled trials (RCTs) are a cornerstone of comparative effectiveness because they remove the confounding bias present in observational studies. However, RCTs are typically much smaller than observational studies because of financial and ethical considerations. Therefore it is of great interest to be able to incorporate plentiful observational data into the analysis of smaller RCTs. Previous estimators developed for this purpose rely on unrealistic additional assumptions without which the added data can bias the effect estimate. Recent work proposed an alternative method (prognostic adjustment) that imposes no additional assumption and increases efficiency in the analysis of RCTs. The idea is to use the observational data to learn a prognostic model: a regression of the outcome onto the covariates. The predictions from this model, generated from the RCT subjects' baseline variables, are used as a covariate in a linear model. In this work, we extend this framework to work when conducting inference with nonparametric efficient estimators in trial analysis. Using simulations, we find that this approach provides greater power (i.e., smaller standard errors) than without prognostic adjustment, especially when the trial is small. We also find that the method is robust to observed or unobserved shifts between the observational and trial populations and does not introduce bias. Lastly, we showcase this estimator leveraging real-world historical data on a randomized blood transfusion study of trauma patients.
The aim of this paper is to develop estimation and inference methods for the drift parameters of multivariate L\'evy-driven continuous-time autoregressive processes of order $p\in\mathbb{N}$. Starting from a continuous-time observation of the process, we develop consistent and asymptotically normal maximum likelihood estimators. We then relax the unrealistic assumption of continuous-time observation by considering natural discretizations based on a combination of Riemann-sum, finite difference, and thresholding approximations. The resulting estimators are also proven to be consistent and asymptotically normal under a general set of conditions, allowing for both finite and infinite jump activity in the driving L\'evy process. When discretizing the estimators, allowing for irregularly spaced observations is of great practical importance. In this respect, CAR($p$) models are not just relevant for "true" continuous-time processes: a CAR($p$) specification provides a natural continuous-time interpolation for modeling irregularly spaced data - even if the observed process is inherently discrete. As a practically relevant application, we consider the setting where the multivariate observation is known to possess a graphical structure. We refer to such a process as GrCAR and discuss the corresponding drift estimators and their properties. The finite sample behavior of all theoretical asymptotic results is empirically assessed by extensive simulation experiments.