We propose a Bayesian approach to estimate finite population means for small areas. The proposed methodology improves on the traditional sample survey methods because, unlike the traditional methods, our proposed method borrows strength from multiple data sources. Our approach is fundamentally different from the existing small area Bayesian approach to the finite population sampling, which typically assumes a hierarchical model for all units of the finite population. We assume such model only for the units of the finite population in which the outcome variable is observed; because for these units, the assumed model can be checked using existing statistical tools. Modeling unobserved units of the finite population is challenging because the assumed model cannot be checked in the absence of data on the outcome variable. To make reasonable modeling assumptions, we propose to form several cells for each small area using factors that potentially influence the outcome variable of interest. This strategy is expected to bring some degree of homogeneity within a given cell and also among cells from different small areas that are constructed with the same factor level combination. Instead of modeling true probabilities for unobserved individual units, we assume that population means of cells with the same combination of factor levels are identical across small areas and the population mean of true probabilities for a cell is identical to the mean of true values for the observed units in that cell. We apply our proposed methodology to a real-life COVID-19 survey, linking information from multiple disparate data sources to estimate vaccine-hesitancy rates (proportions) for 50 US states and Washington, D.C. (small areas). We also provide practical ways of model selection that can be applied to a wider class of models under similar setting but for a diverse range of scientific problems.
Topological data analysis (TDA) approaches are becoming increasingly popular for studying the dependence patterns in multivariate time series data. In particular, various dependence patterns in brain networks may be linked to specific tasks and cognitive processes, which can be altered by various neurological impairments such as epileptic seizures. Existing TDA approaches rely on the notion of distance between data points that is symmetric by definition for building graph filtrations. For brain dependence networks, this is a major limitation that constrains practitioners to using only symmetric dependence measures, such as correlations or coherence. However, it is known that the brain dependence network may be very complex and can contain a directed flow of information from one brain region to another. Such dependence networks are usually captured by more advanced measures of dependence such as partial directed coherence, which is a Granger causality based dependence measure. These dependence measures will result in a non-symmetric distance function, especially during epileptic seizures. In this paper we propose to solve this limitation by decomposing the weighted connectivity network into its symmetric and anti-symmetric components using matrix decomposition and comparing the anti-symmetric component prior to and post seizure. Our analysis of epileptic seizure EEG data shows promising results.
Pini and Vantini (2017) introduced the interval-wise testing procedure which performs local inference for functional data defined on an interval domain, where the output is an adjusted p-value function that controls for type I errors. We extend this idea to a general setting where domain is a Riemannian manifolds. This requires new methodology such as how to define adjustment sets on product manifolds and how to approximate the test statistic when the domain has non-zero curvature. We propose to use permutation tests for inference and apply the procedure in three settings: a simulation on a "chameleon-shaped" manifold and two applications related to climate change where the manifolds are a complex subset of $S^2$ and $S^2 \times S^1$, respectively. We note the tradeoff between type I and type II errors: increasing the adjustment set reduces the type I error but also results in smaller areas of significance. However, some areas still remain significant even at maximal adjustment.
Moment restrictions and their conditional counterparts emerge in many areas of machine learning and statistics ranging from causal inference to reinforcement learning. Estimators for these tasks, generally called methods of moments, include the prominent generalized method of moments (GMM) which has recently gained attention in causal inference. GMM is a special case of the broader family of empirical likelihood estimators which are based on approximating a population distribution by means of minimizing a $\varphi$-divergence to an empirical distribution. However, the use of $\varphi$-divergences effectively limits the candidate distributions to reweightings of the data samples. We lift this long-standing limitation and provide a method of moments that goes beyond data reweighting. This is achieved by defining an empirical likelihood estimator based on maximum mean discrepancy which we term the kernel method of moments (KMM). We provide a variant of our estimator for conditional moment restrictions and show that it is asymptotically first-order optimal for such problems. Finally, we show that our method achieves competitive performance on several conditional moment restriction tasks.
It is often of interest to assess whether a function-valued statistical parameter, such as a density function or a mean regression function, is equal to any function in a class of candidate null parameters. This can be framed as a statistical inference problem where the target estimand is a scalar measure of dissimilarity between the true function-valued parameter and the closest function among all candidate null values. These estimands are typically defined to be zero when the null holds and positive otherwise. While there is well-established theory and methodology for performing efficient inference when one assumes a parametric model for the function-valued parameter, methods for inference in the nonparametric setting are limited. When the null holds, and the target estimand resides at the boundary of the parameter space, existing nonparametric estimators either achieve a non-standard limiting distribution or a sub-optimal convergence rate, making inference challenging. In this work, we propose a strategy for constructing nonparametric estimators with improved asymptotic performance. Notably, our estimators converge at the parametric rate at the boundary of the parameter space and also achieve a tractable null limiting distribution. As illustrations, we discuss how this framework can be applied to perform inference in nonparametric regression problems, and also to perform nonparametric assessment of stochastic dependence.
Biomarkers are often measured in bulk to diagnose patients, monitor patient conditions, and research novel drug pathways. The measurement of these biomarkers often suffers from detection limits that result in missing and untrustworthy measurements. Frequently, missing biomarkers are imputed so that down-stream analysis can be conducted with modern statistical methods that cannot normally handle data subject to informative censoring. This work develops an empirical Bayes $g$-modeling method for imputing and denoising biomarker measurements. We establish superior estimation properties compared to popular methods in simulations and demonstrate the utility of the estimated biomarker measurements for down-stream analysis.
A confidence sequence (CS) is a sequence of confidence intervals that is valid at arbitrary data-dependent stopping times. These are useful in applications like A/B testing, multi-armed bandits, off-policy evaluation, election auditing, etc. We present three approaches to constructing a confidence sequence for the population mean, under the minimal assumption that only an upper bound $\sigma^2$ on the variance is known. While previous works rely on light-tail assumptions like boundedness or subGaussianity (under which all moments of a distribution exist), the confidence sequences in our work are able to handle data from a wide range of heavy-tailed distributions. The best among our three methods -- the Catoni-style confidence sequence -- performs remarkably well in practice, essentially matching the state-of-the-art methods for $\sigma^2$-subGaussian data, and provably attains the $\sqrt{\log \log t/t}$ lower bound due to the law of the iterated logarithm. Our findings have important implications for sequential experimentation with unbounded observations, since the $\sigma^2$-bounded-variance assumption is more realistic and easier to verify than $\sigma^2$-subGaussianity (which implies the former). We also extend our methods to data with infinite variance, but having $p$-th central moment ($1<p<2$).
Repeated measurements are common in many fields, where random variables are observed repeatedly across different subjects. Such data have an underlying hierarchical structure, and it is of interest to learn covariance/correlation at different levels. Most existing methods for sparse covariance/correlation matrix estimation assume independent samples. Ignoring the underlying hierarchical structure and correlation within the subject leads to erroneous scientific conclusions. In this paper, we study the problem of sparse and positive-definite estimation of between-subject and within-subject covariance/correlation matrices for repeated measurements. Our estimators are solutions to convex optimization problems that can be solved efficiently. We establish estimation error rates for the proposed estimators and demonstrate their favorable performance through theoretical analysis and comprehensive simulation studies. We further apply our methods to construct between-subject and within-subject covariance graphs of clinical variables from hemodialysis patients.
Datasets in the real world are often complex and to some degree hierarchical, with groups and sub-groups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these datasets is an important task that has many practical applications. To address this challenge, we present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM). Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.
We study causal effect estimation from a mixture of observational and interventional data in a confounded linear regression model with multivariate treatments. We show that the statistical efficiency in terms of expected squared error can be improved by combining estimators arising from both the observational and interventional setting. To this end, we derive methods based on matrix weighted linear estimators and prove that our methods are asymptotically unbiased in the infinite sample limit. This is an important improvement compared to the pooled estimator using the union of interventional and observational data, for which the bias only vanishes if the ratio of observational to interventional data tends to zero. Studies on synthetic data confirm our theoretical findings. In settings where confounding is substantial and the ratio of observational to interventional data is large, our estimators outperform a Stein-type estimator and various other baselines.
Ridges play a vital role in accurately approximating the underlying structure of manifolds. In this paper, we explore the ridge's variation by applying a concave nonlinear transformation to the density function. Through the derivation of the Hessian matrix, we observe that nonlinear transformations yield a rank-one modification of the Hessian matrix. Leveraging the variational properties of eigenvalue problems, we establish a partial order inclusion relationship among the corresponding ridges. We intuitively discover that the transformation can lead to improved estimation of the tangent space via rank-one modification of the Hessian matrix. To validate our theories, we conduct extensive numerical experiments on synthetic and real-world datasets that demonstrate the superiority of the ridges obtained from our transformed approach in approximating the underlying truth manifold compared to other manifold fitting algorithms.