The joint modeling of longitudinal and time-to-event outcomes has become a popular tool in follow-up studies. However, fitting Bayesian joint models to large datasets, such as patient registries, can require extended computing times. To speed up sampling, we divided a patient registry dataset into subsamples, analyzed them in parallel, and combined the resulting Markov chain Monte Carlo draws into a consensus distribution. We used a simulation study to investigate how different consensus strategies perform with joint models. In particular, we compared grouping all draws together with using equal- and precision-weighted averages. We considered scenarios reflecting different sample sizes, numbers of data splits, and processor characteristics. Parallelization of the sampling process substantially decreased the time required to run the model. We found that the weighted-average consensus distributions for large sample sizes were nearly identical to the target posterior distribution. The proposed algorithm has been made available in an R package for joint models, JMbayes2. This work was motivated by the clinical interest in investigating the association between ppFEV1, a commonly measured marker of lung function, and the risk of lung transplant or death, using data from the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals with 372,366 years of cumulative follow-up). Splitting the registry into five subsamples resulted in an 85\% decrease in computing time, from 9.22 to 1.39 hours. Splitting the data and finding a consensus distribution by precision-weighted averaging proved to be a computationally efficient and robust approach to handling large datasets under the joint modeling framework.
In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to auxiliary information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.
Embedded devices are specialised devices designed for one or only a few purposes. They are often part of a larger system, through wired or wireless connection. Those embedded devices that are connected to other computers or embedded systems through the Internet are called Internet of Things (IoT for short) devices. With their widespread usage and their insufficient protection, these devices are increasingly becoming the target of malware attacks. Companies often cut corners to save manufacturing costs or misconfigure when producing these devices. This can be lack of software updates, ports left open or security defects by design. Although these devices may not be as powerful as a regular computer, their large number makes them suitable candidates for botnets. Other types of IoT devices can even cause health problems since there are even pacemakers connected to the Internet. This means, that without sufficient defence, even directed assaults are possible against people. The goal of this thesis project is to provide better security for these devices with the help of machine learning algorithms and reverse engineering tools. Specifically, I study the applicability of control-flow related data of executables for malware detection. I present a malware detection method with two phases. The first phase extracts control-flow related data using static binary analysis. The second phase classifies binary executables as either malicious or benign using a neural network model. I train the model using a dataset of malicious and benign ARM applications.
Recently, Deep reinforcement learning (DRL) models have shown promising results in solving routing problems. However, most DRL solvers are commonly proposed to solve node routing problems, such as the Traveling Salesman Problem (TSP). Meanwhile, there has been limited research on applying neural methods to arc routing problems, such as the Chinese Postman Problem (CPP), since they often feature irregular and complex solution spaces compared to TSP. To fill these gaps, this paper proposes a novel DRL framework to address the CPP with load-dependent costs (CPP-LC) (Corberan et al., 2018), which is a complex arc routing problem with load constraints. The novelty of our method is two-fold. First, we formulate the CPP-LC as a Markov Decision Process (MDP) sequential model. Subsequently, we introduce an autoregressive model based on DRL, namely Arc-DRL, consisting of an encoder and decoder to address the CPP-LC challenge effectively. Such a framework allows the DRL model to work efficiently and scalably to arc routing problems. Furthermore, we propose a new bio-inspired meta-heuristic solution based on Evolutionary Algorithm (EA) for CPP-LC. Extensive experiments show that Arc-DRL outperforms existing meta-heuristic methods such as Iterative Local Search (ILS) and Variable Neighborhood Search (VNS) proposed by (Corberan et al., 2018) on large benchmark datasets for CPP-LC regarding both solution quality and running time; while the EA gives the best solution quality with much more running time. We release our C++ implementations for metaheuristics such as EA, ILS and VNS along with the code for data generation and our generated data at //github.com/HySonLab/Chinese_Postman_Problem
Capturing the extremal behaviour of data often requires bespoke marginal and dependence models which are grounded in rigorous asymptotic theory, and hence provide reliable extrapolation into the upper tails of the data-generating distribution. We present a modern toolbox of four methodological frameworks, motivated by modern extreme value theory, that can be used to accurately estimate extreme exceedance probabilities or the corresponding level in either a univariate or multivariate setting. Our frameworks were used to facilitate the winning contribution of Team Yalla to the data competition organised for the 13th International Conference on Extreme Value Analysis (EVA2023). This competition comprised seven teams competing across four separate sub-challenges, with each requiring the modelling of data simulated from known, yet highly complex, statistical distributions, and extrapolation far beyond the range of the available samples in order to predict probabilities of extreme events. Data were constructed to be representative of real environmental data, sampled from the fantasy country of "Utopia".
Diffusion model has become a main paradigm for synthetic data generation in many subfields of modern machine learning, including computer vision, language model, or speech synthesis. In this paper, we leverage the power of diffusion model for generating synthetic tabular data. The heterogeneous features in tabular data have been main obstacles in tabular data synthesis, and we tackle this problem by employing the auto-encoder architecture. When compared with the state-of-the-art tabular synthesizers, the resulting synthetic tables from our model show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities. We conducted the experiments over $15$ publicly available datasets. Notably, our model adeptly captures the correlations among features, which has been a long-standing challenge in tabular data synthesis. Our code is available at //github.com/UCLA-Trustworthy-AI-Lab/AutoDiffusion.
Temporal data, obtained in the setting where it is only possible to observe one time point per trajectory, is widely used in different research fields, yet remains insufficiently addressed from the statistical point of view. Such data often contain observations of a large number of entities, in which case it is of interest to identify a small number of representative behavior types. In this paper, we propose a new method performing clustering simultaneously with alignment of temporal objects inferred from these data, providing insight into the relationships between the entities. A series of simulations confirm the ability of the proposed approach to leverage multiple properties of the complex data we target such as accessible uncertainties, correlations and a small number of time points. We illustrate it on real data encoding cellular response to a radiation treatment with high energy, supported with the results of an enrichment analysis.
How do score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold? We investigate the score model of a trained SBM through its linear approximations and subspaces spanned by local feature vectors. During diffusion as the noise decreases, the local dimensionality increases and becomes more varied between different sample sequences. Importantly, we find that the learned vector field mixes samples by a non-conservative field within the manifold, although it denoises with normal projections as if there is an energy function in off-manifold directions. At each noise level, the subspace spanned by the local features overlap with an effective density function. These observations suggest that SBMs can flexibly mix samples with the learned score field while carefully maintaining a manifold-like structure of the data distribution.
Compartmental models provide simple and efficient tools to analyze the relevant transmission processes during an outbreak, to produce short-term forecasts or transmission scenarios, and to assess the impact of vaccination campaigns. However, their calibration is not straightforward, since many factors contribute to the rapid change of the transmission dynamics during an epidemic. For example, there might be changes in the individual awareness, the imposition of non-pharmacological interventions and the emergence of new variants. As a consequence, model parameters such as the transmission rate are doomed to change in time, making their assessment more challenging. Here, we propose to use Physics-Informed Neural Networks (PINNs) to track the temporal changes in the model parameters and provide an estimate of the model state variables. PINNs recently gained attention in many engineering applications thanks to their ability to consider both the information from data (typically uncertain) and the governing equations of the system. The ability of PINNs to identify unknown model parameters makes them particularly suitable to solve ill-posed inverse problems, such as those arising in the application of epidemiological models. Here, we develop a reduced-split approach for the implementation of PINNs to estimate the temporal changes in the state variables and transmission rate of an epidemic based on the SIR model equation and infectious data. The main idea is to split the training first on the epidemiological data, and then on the residual of the system equations. The proposed method is applied to five synthetic test cases and two real scenarios reproducing the first months of the COVID-19 Italian pandemic. Our results show that the split implementation of PINNs outperforms the standard approach in terms of accuracy (up to one order of magnitude) and computational times (speed up of 20%).
The trace plot is seldom used in meta-analysis, yet it is a very informative plot. In this article we define and illustrate what the trace plot is, and discuss why it is important. The Bayesian version of the plot combines the posterior density of tau, the between-study standard deviation, and the shrunken estimates of the study effects as a function of tau. With a small or moderate number of studies, tau is not estimated with much precision, and parameter estimates and shrunken study effect estimates can vary widely depending on the correct value of tau. The trace plot allows visualization of the sensitivity to tau along with a plot that shows which values of tau are plausible and which are implausible. A comparable frequentist or empirical Bayes version provides similar results. The concepts are illustrated using examples in meta-analysis and meta-regression; implementaton in R is facilitated in a Bayesian or frequentist framework using the bayesmeta and metafor packages, respectively.
In many applications, a stochastic system is studied using a model implicitly defined via a simulator. We develop a simulation-based parameter inference method for implicitly defined models. Our method differs from traditional likelihood-based inference in that it uses a metamodel for the distribution of a log-likelihood estimator. The metamodel is built on a local asymptotic normality (LAN) property satisfied by the simulation-based log-likelihood estimator under certain conditions. A method for hypothesis test is developed under the metamodel. Our method can enable accurate parameter estimation and uncertainty quantification where other Monte Carlo methods for parameter inference become highly inefficient due to large Monte Carlo variance. We demonstrate our method using numerical examples including a mechanistic model for the population dynamics of infectious disease.