Behaviour change lies at the heart of many observable collective phenomena such as the transmission and control of infectious diseases, adoption of public health policies, and migration of animals to new habitats. Representing the process of individual behaviour change in computer simulations of these phenomena remains an open challenge. Often, computational models use phenomenological implementations with limited support from behavioural data. Without a strong connection to observable quantities, such models have limited utility for simulating observed and counterfactual scenarios of emergent phenomena because they cannot be validated or calibrated. Here, we present a simple stochastic individual-based model of reversal learning that captures fundamental properties of individual behaviour change, namely, the capacity to learn based on accumulated reward signals, and the transient persistence of learned behaviour after rewards are removed or altered. The model has only two parameters, and we use approximate Bayesian computation to demonstrate that they are fully identifiable from empirical reversal learning time series data. Finally, we demonstrate how the model can be extended to account for the increased complexity of behavioural dynamics over longer time scales involving fluctuating stimuli. This work is a step towards the development and evaluation of fully identifiable individual-level behaviour change models that can function as validated submodels for complex simulations of collective behaviour change.
Data assimilation has become a crucial technique aiming to combine physical models with observational data to estimate state variables. Traditional assimilation algorithms often face challenges of high nonlinearity brought by both the physical and observational models. In this work, we propose a novel data-driven assimilation algorithm based on generative models to address such concerns. Our State-Observation Augmented Diffusion (SOAD) model is designed to handle nonlinear physical and observational models more effectively. The marginal posterior associated with SOAD has been derived and then proved to match the real posterior under mild assumptions, which shows theoretical superiority over previous score-based assimilation works. Experimental results also indicate that our SOAD model may offer improved accuracy over existing data-driven methods.
Traditional low-rank approximation is a powerful tool to compress the huge data matrices that arise in simulations of partial differential equations (PDE), but suffers from high computational cost and requires several passes over the PDE data. The compressed data may also lack interpretability thus making it difficult to identify feature patterns from the original data. To address these issues, we present an online randomized algorithm to compute the interpolative decomposition (ID) of large-scale data matrices {\em in situ}. Compared to previous randomized IDs that used the QR decomposition to determine the column basis, we adopt a streaming ridge leverage score-based column subset selection algorithm that dynamically selects proper basis columns from the data and thus avoids an extra pass over the data to compute the coefficient matrix of the ID. In particular, we adopt a single-pass error estimator based on the non-adaptive Hutch++ algorithm to provide real-time error approximation for determining the best coefficients. As a result, our approach only needs a single pass over the original data and thus is suitable for large and high-dimensional matrices stored outside of core memory or generated in PDE simulations. A strategy to improve the accuracy of the reconstructed data gradient, when desired, within the ID framework is also presented. We provide numerical experiments on turbulent channel flow and ignition simulations, and on the NSTX Gas Puff Image dataset, comparing our algorithm with the offline ID algorithm to demonstrate its utility in real-world applications.
Progress in neuroscience has provided unprecedented opportunities to advance our understanding of brain alterations and their correspondence to phenotypic profiles. With data collected from various imaging techniques, studies have integrated different types of information ranging from brain structure, function, or metabolism. More recently, an emerging way to categorize imaging traits is through a metric hierarchy, including localized node-level measurements and interactive network-level metrics. However, limited research has been conducted to integrate these different hierarchies and achieve a better understanding of the neurobiological mechanisms and communications. In this work, we address this literature gap by proposing a Bayesian regression model under both vector-variate and matrix-variate predictors. To characterize the interplay between different predicting components, we propose a set of biologically plausible prior models centered on an innovative joint thresholded prior. This captures the coupling and grouping effect of signal patterns, as well as their spatial contiguity across brain anatomy. By developing a posterior inference, we can identify and quantify the uncertainty of signaling node- and network-level neuromarkers, as well as their predictive mechanism for phenotypic outcomes. Through extensive simulations, we demonstrate that our proposed method outperforms the alternative approaches substantially in both out-of-sample prediction and feature selection. By implementing the model to study children's general mental abilities, we establish a powerful predictive mechanism based on the identified task contrast traits and resting-state sub-networks.
When applying multivariate extreme values statistics to analyze tail risk in compound events defined by a multivariate random vector, one often assumes that all dimensions share the same extreme value index. While such an assumption can be tested using a Wald-type test, the performance of such a test deteriorates as the dimensionality increases. This paper introduces a novel test for testing extreme value indices in a high dimensional setting. We show the asymptotic behavior of the test statistic and conduct simulation studies to evaluate its finite sample performance. The proposed test significantly outperforms existing methods in high dimensional settings. We apply this test to examine two datasets previously assumed to have identical extreme value indices across all dimensions.
Estimating risk factors for incidence of a disease is crucial for understanding its etiology. For diseases caused by enteric pathogens, off-the-shelf statistical model-based approaches do not provide biological plausibility and ignore important sources of variability. We propose a new approach to estimating incidence risk factors built on established work in quantitative microbiological risk assessment. Excepting those risk factors which affect both dose accrual and within-host pathogen survival rates, our model's regression parameters are easily interpretable as the dose accrual rate ratio due to the risk factors under study. % So long as risk factors do not affect both dose accrual and within-host pathogen survival rates, our model parameters are easily interpretable as the dose accrual rate ratio due to the risk factors under study. We also describe a method for leveraging information across multiple pathogens. The proposed methods are available as an R package at \url{//github.com/dksewell/ladie}. Our simulation study shows unacceptable coverage rates from generalized linear models, while the proposed approach maintains the nominal rate even when the model is misspecified. Finally, we demonstrated our proposed approach by applying our method to Nairobian infant data obtained through the PATHOME study (\url{//reporter.nih.gov/project-details/10227256}), discovering the impact of various environmental factors on infant enteric infections.
Many biological systems such as cell aggregates, tissues or bacterial colonies behave as unconventional systems of particles that are strongly constrained by volume exclusion and shape interactions. Understanding how these constraints lead to macroscopic self-organized structures is a fundamental question in e.g. developmental biology. To this end, various types of computational models have been developed: phase fields, cellular automata, vertex models, level-set, finite element simulations, etc. We introduce a new framework based on optimal transport theory to model particle systems with arbitrary dynamical shapes and deformability. Our method builds upon the pioneering work of Brenier on incompressible fluids and its recent applications to materials science. It lets us specify the shapes of individual cells and supports a wide range of interaction mechanisms, while automatically taking care of the volume exclusion constraint at an affordable numerical cost. We showcase the versatility of this approach by reproducing several classical systems in computational biology. Our Python code is freely available at: www.github.com/antoinediez/ICeShOT
There is a recent boom in the development of AI solutions to facilitate and enhance diagnostic procedures for established clinical tools. To assess the integrity of the developing nervous system, the Prechtl general movement assessment (GMA) is recognized for its clinical value in diagnosing neurological impairments in early infancy. GMA has been increasingly augmented through machine learning approaches intending to scale-up its application, circumvent costs in the training of human assessors and further standardize classification of spontaneous motor patterns. Available deep learning tools, all of which are based on single sensor modalities, are however still considerably inferior to that of well-trained human assessors. These approaches are hardly comparable as all models are designed, trained and evaluated on proprietary/silo-data sets. With this study we propose a sensor fusion approach for assessing fidgety movements (FMs) comparing three different sensor modalities (pressure, inertial, and visual sensors). Various combinations and two sensor fusion approaches (late and early fusion) for infant movement classification were tested to evaluate whether a multi-sensor system outperforms single modality assessments. The performance of the three-sensor fusion (classification accuracy of 94.5\%) was significantly higher than that of any single modality evaluated, suggesting the sensor fusion approach is a promising avenue for automated classification of infant motor patterns. The development of a robust sensor fusion system may significantly enhance AI-based early recognition of neurofunctions, ultimately facilitating automated early detection of neurodevelopmental conditions.
For obtaining optimal first-order convergence guarantee for stochastic optimization, it is necessary to use a recurrent data sampling algorithm that samples every data point with sufficient frequency. Most commonly used data sampling algorithms (e.g., i.i.d., MCMC, random reshuffling) are indeed recurrent under mild assumptions. In this work, we show that for a particular class of stochastic optimization algorithms, we do not need any other property (e.g., independence, exponential mixing, and reshuffling) than recurrence in data sampling algorithms to guarantee the optimal rate of first-order convergence. Namely, using regularized versions of Minimization by Incremental Surrogate Optimization (MISO), we show that for non-convex and possibly non-smooth objective functions, the expected optimality gap converges at an optimal rate $O(n^{-1/2})$ under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the `speed of recurrence', measured by the expected amount of time to visit a given data point either averaged (`target time') or supremized (`hitting time') over the current location. We demonstrate theoretically and empirically that convergence can be accelerated by selecting sampling algorithms that cover the data set most effectively. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.
In toxicology, the validation of the concurrent control by historical control data (HCD) has become requirements. This validation is usually done by historical control limits (HCL) which in practice are often graphically displayed in a Sheward control chart like manner. In many applications, HCL are applied to dichotomous data, e.g. the number of rats with a tumor vs. the number of rats without a tumor (carcinogenicity studies) or the number of cells with a micronucleus out of a total number of cells. Dichotomous HCD may be overdispersed and can be heavily right- (or left-) skewed, which is usually not taken into account in the practical applications of HCL. To overcome this problem, four different prediction intervals (two frequentist, two Bayesian), that can be applied to such data, are proposed. Comprehensive Monte-Carlo simulations assessing the coverage probabilities of seven different methods for HCL calculation reveal, that frequentist bootstrap calibrated prediction intervals control the type-1-error best. Heuristics traditionally used in control charts (e.g. the limits in Sheward np-charts or the mean plus minus 2 SD) as well a the historical range fail to control a pre-specified coverage probability. The application of HCL is demonstrated based on a real life data set containing historical controls from long-term carcinogenicity studies run on behalf of the U.S. National Toxicology Program. The proposed frequentist prediction intervals are publicly available from the R package predint, whereas R code for the computation of the Bayesian prediction intervals is provided via GitHub.
Current approaches to identifying driving heterogeneity face challenges in comprehending fundamental patterns from the perspective of underlying driving behavior mechanisms. The concept of Action phases was proposed in our previous work, capturing the diversity of driving characteristics with physical meanings. This study presents a novel framework to further interpret driving patterns by classifying Action phases in an unsupervised manner. In this framework, a Resampling and Downsampling Method (RDM) is first applied to standardize the length of Action phases. Then the clustering calibration procedure including ''Feature Selection'', ''Clustering Analysis'', ''Difference/Similarity Evaluation'', and ''Action phases Re-extraction'' is iteratively applied until all differences among clusters and similarities within clusters reach the pre-determined criteria. Application of the framework using real-world datasets revealed six driving patterns in the I80 dataset, labeled as ''Catch up'', ''Keep away'', and ''Maintain distance'', with both ''Stable'' and ''Unstable'' states. Notably, Unstable patterns are more numerous than Stable ones. ''Maintain distance'' is the most common among Stable patterns. These observations align with the dynamic nature of driving. Two patterns ''Stable keep away'' and ''Unstable catch up'' are missing in the US101 dataset, which is in line with our expectations as this dataset was previously shown to have less heterogeneity. This demonstrates the potential of driving patterns in describing driving heterogeneity. The proposed framework promises advantages in addressing label scarcity in supervised learning and enhancing tasks such as driving behavior modeling and driving trajectory prediction.