In the management of most chronic conditions characterized by the lack of universally effective treatments, adaptive treatment strategies (ATSs) have been growing in popularity as they offer a more individualized approach, and sequential multiple assignment randomized trials (SMARTs) have gained attention as the most suitable clinical trial design to formalize the study of these strategies. While the number of SMARTs has increased in recent years, their design has remained limited to the frequentist setting, which may not fully or appropriately account for uncertainty in design parameters and hence not yield appropriate sample size recommendations. Specifically, standard frequentist formulae rely on several assumptions that can be easily misspecified. The Bayesian framework offers a straightforward path to alleviate some of these concerns. In this paper, we provide calculations in a Bayesian setting to allow more realistic and robust estimates that account for uncertainty in inputs through the `two priors' approach. Additionally, compared to the standard formulae, this methodology allows us to rely on fewer assumptions, integrate pre-trial knowledge, and switch the focus from the standardized effect size to the minimal detectable difference. The proposed methodology is evaluated in a thorough simulation study and is implemented to estimate the sample size for a full-scale SMART of an Internet-Based Adaptive Stress Management intervention based on a pilot SMART conducted on cardiovascular disease patients from two Canadian provinces.
Many statistical analyses assume that the data points within a sample are exchangeable and their features have some known dependency structure. Given a feature dependency structure, one can ask if the observations are exchangeable, in which case we say that they are homogeneous. Homogeneity may be the end goal of a clustering algorithm or a justification for not clustering. Apart from random matrix theory approaches, few general approaches provide statistical guarantees of exchangeability or homogeneity without labeled examples from distinct clusters. We propose a fast and flexible non-parametric hypothesis testing approach that takes as input a multivariate individual-by-feature dataset and user-specified feature dependency constraints, without labeled examples, and reports whether the individuals are exchangeable at a user-specified significance level. Our approach controls Type I error across realistic scenarios and handles data of arbitrary dimension. We perform an extensive simulation study to evaluate the efficacy of domain-agnostic tests of stratification, and find that our approach compares favorably in various scenarios of interest. Finally, we apply our approach to post-clustering single-cell chromatin accessibility data and World Values Survey data, and show how it helps to identify drivers of heterogeneity and generate clusters of exchangeable individuals.
This tutorial shows how to build, fit, and criticize disease transmission models in Stan, and should be useful to researchers interested in modeling the SARS-CoV-2 pandemic and other infectious diseases in a Bayesian framework. Bayesian modeling provides a principled way to quantify uncertainty and incorporate both data and prior knowledge into the model estimates. Stan is an expressive probabilistic programming language that abstracts the inference and allows users to focus on the modeling. As a result, Stan code is readable and easily extensible, which makes the modeler's work more transparent. Furthermore, Stan's main inference engine, Hamiltonian Monte Carlo sampling, is amiable to diagnostics, which means the user can verify whether the obtained inference is reliable. In this tutorial, we demonstrate how to formulate, fit, and diagnose a compartmental transmission model in Stan, first with a simple Susceptible-Infected-Recovered (SIR) model, then with a more elaborate transmission model used during the SARS-CoV-2 pandemic. We also cover advanced topics which can further help practitioners fit sophisticated models; notably, how to use simulations to probe the model and priors, and computational techniques to scale-up models based on ordinary differential equations.
In this paper, we write the time-varying parameter (TVP) regression model involving K explanatory variables and T observations as a constant coefficient regression model with KT explanatory variables. In contrast with much of the existing literature which assumes coefficients to evolve according to a random walk, a hierarchical mixture model on the TVPs is introduced. The resulting model closely mimics a random coefficients specification which groups the TVPs into several regimes. These flexible mixtures allow for TVPs that feature a small, moderate or large number of structural breaks. We develop computationally efficient Bayesian econometric methods based on the singular value decomposition of the KT regressors. In artificial data, we find our methods to be accurate and much faster than standard approaches in terms of computation time. In an empirical exercise involving inflation forecasting using a large number of predictors, we find our models to forecast better than alternative approaches and document different patterns of parameter change than are found with approaches which assume random walk evolution of parameters.
We propose an extension of the N-mixture model which allows for the estimation of both abundances of multiple species simultaneously and their inter-species correlations. We also propose further extensions to this multi-species N-mixture model, one of which permits us to examine data which has an excess of zero counts, and another which allows us to relax the assumption of closure inherent in N-mixture models through the incorporation of an AR term in the abundance. The inclusion of a multivariate normal distribution as prior on the random effect in the abundance facilitates the estimation of a matrix of interspecies correlations. Each model is also fitted to avian point data collected as part of the NABBS 2010-2019. Results of simulation studies reveal that these models produce accurate estimates of abundance, inter-species correlations and detection probabilities at both small and large sample sizes, in scenarios with small, large and no zero inflation. Results of model-fitting to the North American Breeding Bird Survey data reveal an increase in Bald Eagle population size in southeastern Alaska in the decade examined.Our novel multi-species N-mixture model accounts for full communities, allowing us to examine abundances of every species present in a study area and, as these species do not exist in a vacuum, allowing us to estimate correlations between species' abundances.While previous multi-species abundance models have allowed for the estimation of abundance and detection probability, ours is the first to address the estimation of both positive and negative inter-species correlations, which allows us to begin to make inferences as to the effect that these species' abundances have on one another. Our modelling approach provides a method of quantifying the strength of association between species' population sizes, and is of practical use to population and conservation ecologists.
Maximum Mean Discrepancy (MMD) has been widely used in the areas of machine learning and statistics to quantify the distance between two distributions in the $p$-dimensional Euclidean space. The asymptotic property of the sample MMD has been well studied when the dimension $p$ is fixed using the theory of U-statistic. As motivated by the frequent use of MMD test for data of moderate/high dimension, we propose to investigate the behavior of the sample MMD in a high-dimensional environment and develop a new studentized test statistic. Specifically, we obtain the central limit theorems for the studentized sample MMD as both the dimension $p$ and sample sizes $n,m$ diverge to infinity. Our results hold for a wide range of kernels, including popular Gaussian and Laplacian kernels, and also cover energy distance as a special case. We also derive the explicit rate of convergence under mild assumptions and our results suggest that the accuracy of normal approximation can improve with dimensionality. Additionally, we provide a general theory on the power analysis under the alternative hypothesis and show that our proposed test can detect difference between two distributions in the moderately high dimensional regime. Numerical simulations demonstrate the effectiveness of our proposed test statistic and normal approximation.
Understanding factors that contribute to the increased likelihood of disease transmission between two individuals is important for infection control. Measures of genetic relatedness of bacterial isolates between two individuals are often analyzed to determine their associations with these factors using simple correlation or regression analyses. However, these standard approaches ignore the potential for correlation in paired data of this type, arising from the presence of the same individual across multiple paired outcomes. We develop two novel hierarchical Bayesian methods for properly analyzing paired genetic relatedness data in the form of patristic distances and transmission probabilities. Using individual-level spatially correlated random effect parameters, we account for multiple sources of correlation in the outcomes as well as other important features of their distribution. Through simulation, we show that the standard analyses drastically underestimate uncertainty in the associations when correlation is present in the data, leading to incorrect conclusions regarding the covariates of interest. Conversely, the newly developed methods perform well under various levels of correlated and uncorrelated data. All methods are applied to Mycobacterium tuberculosis data from the Republic of Moldova where we identify factors associated with disease transmission and, through analysis of the random effect parameters, key individuals and areas with increased transmission activity. Model comparisons show the importance of the new methodology in this setting. The methods are implemented in the R package GenePair.
The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator's reward function.
Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a device unavailable for training or unable to upload its model updates. Unfortunately, these impacts have never been systematically studied and quantified in existing FL literature. In this paper, we carry out the first empirical study to characterize the impacts of heterogeneity in FL. We collect large-scale data from 136k smartphones that can faithfully reflect heterogeneity in real-world settings. We also build a heterogeneity-aware FL platform that complies with the standard FL protocol but with heterogeneity in consideration. Based on the data and the platform, we conduct extensive experiments to compare the performance of state-of-the-art FL algorithms under heterogeneity-aware and heterogeneity-unaware settings. Results show that heterogeneity causes non-trivial performance degradation in FL, including up to 9.2% accuracy drop, 2.32x lengthened training time, and undermined fairness. Furthermore, we analyze potential impact factors and find that device failure and participant bias are two potential factors for performance degradation. Our study provides insightful implications for FL practitioners. On the one hand, our findings suggest that FL algorithm designers consider necessary heterogeneity during the evaluation. On the other hand, our findings urge system providers to design specific mechanisms to mitigate the impacts of heterogeneity.
Because of continuous advances in mathematical programing, Mix Integer Optimization has become a competitive vis-a-vis popular regularization method for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. We tackle these challenges, reducing computational burden when tuning the sparsity bound (a parameter which is critical for effectiveness) and improving performance in the presence of feature collinearity and of signals that vary in nature and strength. Importantly, we render the approach efficient and effective in applications of realistic size and complexity - without resorting to relaxations or heuristics in the optimization, or abandoning rigorous cross-validation tuning. Computational viability and improved performance in subtler scenarios is achieved with a multi-pronged blueprint, leveraging characteristics of the Mixed Integer Programming framework and by means of whitening, a data pre-processing step.
In this paper we introduce a covariance framework for the analysis of EEG and MEG data that takes into account observed temporal stationarity on small time scales and trial-to-trial variations. We formulate a model for the covariance matrix, which is a Kronecker product of three components that correspond to space, time and epochs/trials, and consider maximum likelihood estimation of the unknown parameter values. An iterative algorithm that finds approximations of the maximum likelihood estimates is proposed. We perform a simulation study to assess the performance of the estimator and investigate the influence of different assumptions about the covariance factors on the estimated covariance matrix and on its components. Apart from that, we illustrate our method on real EEG and MEG data sets. The proposed covariance model is applicable in a variety of cases where spontaneous EEG or MEG acts as source of noise and realistic noise covariance estimates are needed for accurate dipole localization, such as in evoked activity studies, or where the properties of spontaneous EEG or MEG are themselves the topic of interest, such as in combined EEG/fMRI experiments in which the correlation between EEG and fMRI signals is investigated.