Joint models (JM) for longitudinal and survival data have gained increasing interest and found applications in a wide range of clinical and biomedical settings. These models facilitate the understanding of the relationship between outcomes and enable individualized predictions. In many applications, more complex event processes arise, necessitating joint longitudinal and multistate models. However, their practical application can be hindered by computational challenges due to increased model complexity and large sample sizes. Motivated by a longitudinal multimorbidity analysis of large UK health records, we have developed a scalable Bayesian methodology for such joint multistate models that is capable of handling complex event processes and large datasets, with straightforward implementation. We propose two blockwise inference approaches for different inferential purposes based on different levels of decomposition of the multistate processes. These approaches leverage parallel computing, ease the specification of different models for different transitions, and model/variable selection can be performed within a Bayesian framework using Bayesian leave-one-out cross-validation. Using a simulation study, we show that the proposed approaches achieve satisfactory performance regarding posterior point and interval estimation, with notable gains in sampling efficiency compared to the standard estimation strategy. We illustrate our approaches using a large UK electronic health record dataset where we analysed the coevolution of routinely measured systolic blood pressure (SBP) and the progression of multimorbidity, defined as the combinations of three chronic conditions. Our analysis identified distinct association structures between SBP and different disease transitions.
Recurrent neural networks (RNNs) have yielded promising results for both recognizing objects in challenging conditions and modeling aspects of primate vision. However, the representational dynamics of recurrent computations remain poorly understood, especially in large-scale visual models. Here, we studied such dynamics in RNNs trained for object classification on MiniEcoset, a novel subset of ecoset. We report two main insights. First, upon inference, representations continued to evolve after correct classification, suggesting a lack of the notion of being ``done with classification''. Second, focusing on ``readout zones'' as a way to characterize the activation trajectories, we observe that misclassified representations exhibit activation patterns with lower L2 norm, and are positioned more peripherally in the readout zones. Such arrangements help the misclassified representations move into the correct zones as time progresses. Our findings generalize to networks with lateral and top-down connections, and include both additive and multiplicative interactions with the bottom-up sweep. The results therefore contribute to a general understanding of RNN dynamics in naturalistic tasks. We hope that the analysis framework will aid future investigations of other types of RNNs, including understanding of representational dynamics in primate vision.
Multiscale modeling of complex systems is crucial for understanding their intricacies. Data-driven multiscale modeling has emerged as a promising approach to tackle challenges associated with complex systems. On the other hand, self-similarity is prevalent in complex systems, hinting that large-scale complex systems can be modeled at a reduced cost. In this paper, we introduce a multiscale neural network framework that incorporates self-similarity as prior knowledge, facilitating the modeling of self-similar dynamical systems. For deterministic dynamics, our framework can discern whether the dynamics are self-similar. For uncertain dynamics, it can compare and determine which parameter set is closer to self-similarity. The framework allows us to extract scale-invariant kernels from the dynamics for modeling at any scale. Moreover, our method can identify the power law exponents in self-similar systems. Preliminary tests on the Ising model yielded critical exponents consistent with theoretical expectations, providing valuable insights for addressing critical phase transitions in non-equilibrium systems.
We present a simulation strategy for the real-time dynamics of quantum fields, inspired by reinforcement learning. It builds on the complex Langevin approach, which it amends with system specific prior information, a necessary prerequisite to overcome this exceptionally severe sign problem. The optimization process underlying our machine learning approach is made possible by deploying inherently stable solvers of the complex Langevin stochastic process and a novel optimality criterion derived from insight into so-called boundary terms. This conceptual and technical progress allows us to both significantly extend the range of real-time simulations in 1+1d scalar field theory beyond the state-of-the-art and to avoid discretization artifacts that plagued previous real-time field theory simulations. Limitations of and promising future directions are discussed.
The lock set method and the partial order method are two main approaches to guarantee that dynamic data race prediction remains efficient. There are many variations of these ideas. Common to all of them is the assumption that the events in a critical section belong to the same thread. We have evidence that critical sections in the wild do extend across thread boundaries even if the surrounding acquire and release events occur in the same thread. We introduce the novel concept of a cross-thread critical section to capture such situations, offer a theoretical comprehensive framework, and study their impact on state-of-the-art data race analyses. For sound partial order relations such as WCP, SDP, and DCtp, the occurrence of cross-thread critical sections negatively impacts their precision. For complete partial order relations such as WDP and PWR, cross-thread critical sections help to eliminate more false positives. The same (positive) impact applies to the lock set construction. Our experimental evaluation confirms that cross-thread critical sections arise in practice. For the complete relation PWR, we are able to reduce the number of false positives. The performance overhead incurred by tracking cross-thread critical sections slows down the analysis by 10\%-20\%, on average.
Suitable discretizations through tensor product formulas of popular multidimensional operators (diffusion--advection, for instance) lead to matrices with $d$-dimensional Kronecker sum structure. For evolutionary PDEs containing such operators and integrated in time with exponential integrators, it is of paramount importance to efficiently approximate actions of $\varphi$-functions of this kind of matrices. In this work, we show how to produce directional split approximations of third order with respect to the time step size. They conveniently employ tensor-matrix products (realized with highly performance level 3 BLAS) and that allow for the effective usage in practice of exponential integrators up to order three. The approach has been successfully tested against state-of-the-art techniques on two well-known physical models, namely FitzHugh--Nagumo and Schnakenberg.
Traditional static functional data analysis is facing new challenges due to streaming data, where data constantly flow in. A major challenge is that storing such an ever-increasing amount of data in memory is nearly impossible. In addition, existing inferential tools in online learning are mainly developed for finite-dimensional problems, while inference methods for functional data are focused on the batch learning setting. In this paper, we tackle these issues by developing functional stochastic gradient descent algorithms and proposing an online bootstrap resampling procedure to systematically study the inference problem for functional linear regression. In particular, the proposed estimation and inference procedures use only one pass over the data; thus they are easy to implement and suitable to the situation where data arrive in a streaming manner. Furthermore, we establish the convergence rate as well as the asymptotic distribution of the proposed estimator. Meanwhile, the proposed perturbed estimator from the bootstrap procedure is shown to enjoy the same theoretical properties, which provide the theoretical justification for our online inference tool. As far as we know, this is the first inference result on the functional linear regression model with streaming data. Simulation studies are conducted to investigate the finite-sample performance of the proposed procedure. An application is illustrated with the Beijing multi-site air-quality data.
One of the main challenges for interpreting black-box models is the ability to uniquely decompose square-integrable functions of non-mutually independent random inputs into a sum of functions of every possible subset of variables. However, dealing with dependencies among inputs can be complicated. We propose a novel framework to study this problem, linking three domains of mathematics: probability theory, functional analysis, and combinatorics. We show that, under two reasonable assumptions on the inputs (non-perfect functional dependence and non-degenerate stochastic dependence), it is always possible to decompose uniquely such a function. This ``canonical decomposition'' is relatively intuitive and unveils the linear nature of non-linear functions of non-linearly dependent inputs. In this framework, we effectively generalize the well-known Hoeffding decomposition, which can be seen as a particular case. Oblique projections of the black-box model allow for novel interpretability indices for evaluation and variance decomposition. Aside from their intuitive nature, the properties of these novel indices are studied and discussed. This result offers a path towards a more precise uncertainty quantification, which can benefit sensitivity analyses and interpretability studies, whenever the inputs are dependent. This decomposition is illustrated analytically, and the challenges to adopting these results in practice are discussed.
Private synthetic data sharing is preferred as it keeps the distribution and nuances of original data compared to summary statistics. The state-of-the-art methods adopt a select-measure-generate paradigm, but measuring large domain marginals still results in much error and allocating privacy budget iteratively is still difficult. To address these issues, our method employs a partition-based approach that effectively reduces errors and improves the quality of synthetic data, even with a limited privacy budget. Results from our experiments demonstrate the superiority of our method over existing approaches. The synthetic data produced using our approach exhibits improved quality and utility, making it a preferable choice for private synthetic data sharing.
Projection-based testing for mean trajectory differences in two groups of irregularly and sparsely observed functional data has garnered significant attention in the literature because it accommodates a wide spectrum of group differences and (non-stationary) covariance structures. This article presents the derivation of the theoretical power function and the introduction of a comprehensive power and sample size (PASS) calculation toolkit tailored to the projection-based testing method developed by Wang (2021). Our approach accommodates a wide spectrum of group difference scenarios and a broad class of covariance structures governing the underlying processes. Through extensive numerical simulation, we demonstrate the robustness of this testing method by showcasing that its statistical power remains nearly unaffected even when a certain percentage of observations are missing, rendering it 'missing-immune'. Furthermore, we illustrate the practical utility of this test through analysis of two randomized controlled trials of Parkinson's disease. To facilitate implementation, we provide a user-friendly R package fPASS, complete with a detailed vignette to guide users through its practical application. We anticipate that this article will significantly enhance the usability of this potent statistical tool across a range of biostatistical applications, with a particular focus on its relevance in the design of clinical trials.
With observational data alone, causal structure learning is a challenging problem. The task becomes easier when having access to data collected from perturbations of the underlying system, even when the nature of these is unknown. Existing methods either do not allow for the presence of latent variables or assume that these remain unperturbed. However, these assumptions are hard to justify if the nature of the perturbations is unknown. We provide results that enable scoring causal structures in the setting with additive, but unknown interventions. Specifically, we propose a maximum-likelihood estimator in a structural equation model that exploits system-wide invariances to output an equivalence class of causal structures from perturbation data. Furthermore, under certain structural assumptions on the population model, we provide a simple graphical characterization of all the DAGs in the interventional equivalence class. We illustrate the utility of our framework on synthetic data as well as real data involving California reservoirs and protein expressions. The software implementation is available as the Python package \emph{utlvce}.