We present a Python implementation for RS-HDMR-GPR (Random Sampling High Dimensional Model Representation Gaussian Process Regression). The method builds representations of multivariate functions with lower-dimensional terms, either as an expansion over orders of coupling or using terms of only a given dimensionality. This facilitates, in particular, recovering functional dependence from sparse data. The code also allows for imputation of missing values of the variables and for a significant pruning of the useful number of HDMR terms. The code can also be used for estimating relative importance of different combinations of input variables, thereby adding an element of insight to a general machine learning method. The capabilities of this regression tool are demonstrated on test cases involving synthetic analytic functions, the potential energy surface of the water molecule, kinetic energy densities of materials (crystalline magnesium, aluminum, and silicon), and financial market data.
This paper proposes a general procedure to analyse high-dimensional spatio-temporal count data, with special emphasis on relative risks estimation in cancer epidemiology. Model fitting is carried out using integrated nested Laplace approximations over a partition of the spatio-temporal domain. This is a simple idea that works very well in this context as the models are defined to borrow strength locally in space and time, providing reliable risk estimates. Parallel and distributed strategies are proposed to speed up computations in a setting where Bayesian model fitting is generally prohibitively time-consuming and even unfeasible. We evaluate the whole procedure in a simulation study with a twofold objective: to estimate risks accurately and to detect extreme risk areas while avoiding false positives/negatives. We show that our method outperforms classical global models. A real data analysis comparing the global models and the new procedure is also presented.
Ensembles of networks arise in various fields where multiple independent networks are observed on the same set of nodes, for example, a collection of brain networks constructed on the same brain regions for different individuals. However, there are few models that describe both the variations and characteristics of networks in an ensemble at the same time. In this paper, we propose to model the ensemble of networks using a Dirichlet Process Mixture of Exponential Random Graph Models (DPM-ERGMs), which divides the ensemble into different clusters and models each cluster of networks using a separate Exponential Random Graph Model (ERGM). By employing a Dirichlet process mixture, the number of clusters can be determined automatically and changed adaptively with the data provided. Moreover, in order to perform full Bayesian inference for DPM-ERGMs, we employ the intermediate importance sampling technique inside the Metropolis-within-slice sampling scheme, which addressed the problem of sampling from the intractable ERGMs on an infinite sample space. We also demonstrate the performance of DPM-ERGMs with both simulated and real datasets.
The recently proposed statistical finite element (statFEM) approach synthesises measurement data with finite element models and allows for making predictions about the true system response. We provide a probabilistic error analysis for a prototypical statFEM setup based on a Gaussian process prior under the assumption that the noisy measurement data are generated by a deterministic true system response function that satisfies a second-order elliptic partial differential equation for an unknown true source term. In certain cases, properties such as the smoothness of the source term may be misspecified by the Gaussian process model. The error estimates we derive are for the expectation with respect to the measurement noise of the $L^2$-norm of the difference between the true system response and the mean of the statFEM posterior. The estimates imply polynomial rates of convergence in the numbers of measurement points and finite element basis functions and depend on the Sobolev smoothness of the true source term and the Gaussian process model. A numerical example for Poisson's equation is used to illustrate these theoretical results.
The software package $\texttt{mstate}$, in articulation with the package $\texttt{survival}$, provides not only a well-established multi-state survival analysis framework in R, but also one of the most complete, as it includes point and interval estimation of relative transition hazards, cumulative transition hazards and state occupation probabilities, both under clock-forward and clock-reset models; personalised estimates, i.e. estimates for an individual with specific covariate measurements, can also be obtained with $\texttt{mstate}$ by fitting a Cox regression model. The new R package $\texttt{ebmstate}$, which we present in the current paper, is an extension of $\texttt{mstate}$ and, to our knowledge, the first R package for multi-state model estimation that is suitable for higher-dimensional data and complete in the sense just mentioned. Its extension of $\texttt{mstate}$ is threefold: it transforms the Cox model into a regularised, empirical Bayes model that performs significantly better with higher-dimensional data; it replaces asymptotic confidence intervals meant for the low-dimensional setting by non-parametric bootstrap confidence intervals; and it introduces an analytical, Fourier transform-based estimator of state occupation probabilities for clock-reset models that is substantially faster than the corresponding, simulation-based estimator in $\texttt{mstate}$. The present paper includes a detailed tutorial on how to use our package to estimate transition hazards and state occupation probabilities, as well as a simulation study showing how it improves the performance of $\texttt{mstate}$.
We introduce certain sparse representation methods, named as stochastic pre-orthogonal adaptive Fourier decomposition 1 and 2 (SPOAFD1 and SPOAFD2) to solve the Dirichlet boundary value problem and the Cauchy initial value problem of random data. To solve the stochastic boundary value problems the sparse representation is, as the initial step, applied to the random boundary data. Due to the semigroup property of the Poisson and the heat kernel, each entry of the expanding series can be lifted up to compose a solution of the Dirichlet and the Cauchy initial value problem, respectively. The sparse representation gives rise to analytic as well as numerical solutions to the problems with high efficiency.
This work is motivated by personalized digital twins based on observations and physical models for treatment and prevention of Hypertension. The models commonly used are simplification of the real process and the aim is to make inference about physically interpretable parameters. To account for model discrepancy we propose to set up the estimation problem in a Bayesian calibration framework. This naturally solves the inverse problem accounting for and quantifying the uncertainty in the model formulation, in the parameter estimates and predictions. We focus on the inverse problem, i.e. to estimate the physical parameters given observations. The models we consider are the two and three parameters Windkessel models (WK2 and WK3). These models simulate the blood pressure waveform given the blood inflow and a set of physically interpretable calibration parameters. The third parameter in WK3 function as a tuning parameter. The WK2 model offers physical interpretable parameters and therefore we adopt it as a computer model choice in a Bayesian calibration formulation. In a synthetic simulation study, we simulate noisy data from the WK3 model. We estimate the model parameters using conventional methods, i.e. least squares optimization and through the Bayesian calibration framework. It is demonstrated that our formulation can reconstruct the blood pressure waveform of the complex model, but most importantly can learn the parameters according to known mathematical connections between the two models. We also successfully apply this formulation to a real case study, where data was obtained from a pilot randomized controlled trial study. Our approach is successful for both the simulation study and the real cases.
We investigate the problem of approximating the matrix function $f(A)$ by $r(A)$, with $f$ a Markov function, $r$ a rational interpolant of $f$, and $A$ a symmetric Toeplitz matrix. In a first step, we obtain a new upper bound for the relative interpolation error $1-r/f$ on the spectral interval of $A$. By minimizing this upper bound over all interpolation points, we obtain a new, simple and sharp a priori bound for the relative interpolation error. We then consider three different approaches of representing and computing the rational interpolant $r$. Theoretical and numerical evidence is given that any of these methods for a scalar argument allows to achieve high precision, even in the presence of finite precision arithmetic. We finally investigate the problem of efficiently evaluating $r(A)$, where it turns out that the relative error for a matrix argument is only small if we use a partial fraction decomposition for $r$ following Antoulas and Mayo. An important role is played by a new stopping criterion which ensures to automatically find the degree of $r$ leading to a small error, even in presence of finite precision arithmetic.
Symbolic representations are a useful tool for the dimension reduction of temporal data, allowing for the efficient storage of and information retrieval from time series. They can also enhance the training of machine learning algorithms on time series data through noise reduction and reduced sensitivity to hyperparameters. The adaptive Brownian bridge-based aggregation (ABBA) method is one such effective and robust symbolic representation, demonstrated to accurately capture important trends and shapes in time series. However, in its current form the method struggles to process very large time series. Here we present a new variant of the ABBA method, called fABBA. This variant utilizes a new aggregation approach tailored to the piecewise representation of time series. By replacing the k-means clustering used in ABBA with a sorting-based aggregation technique, and thereby avoiding repeated sum-of-squares error computations, the computational complexity is significantly reduced. In contrast to the original method, the new approach does not require the number of time series symbols to be specified in advance. Through extensive tests we demonstrate that the new method significantly outperforms ABBA with a considerable reduction in runtime while also outperforming the popular SAX and 1d-SAX representations in terms of reconstruction accuracy. We further demonstrate that fABBA can compress other data types such as images.
Molecular graph generation is a fundamental problem for drug discovery and has been attracting growing attention. The problem is challenging since it requires not only generating chemically valid molecular structures but also optimizing their chemical properties in the meantime. Inspired by the recent progress in deep generative models, in this paper we propose a flow-based autoregressive model for graph generation called GraphAF. GraphAF combines the advantages of both autoregressive and flow-based approaches and enjoys: (1) high model flexibility for data density estimation; (2) efficient parallel computation for training; (3) an iterative sampling process, which allows leveraging chemical domain knowledge for valency checking. Experimental results show that GraphAF is able to generate 68% chemically valid molecules even without chemical knowledge rules and 100% valid molecules with chemical rules. The training process of GraphAF is two times faster than the existing state-of-the-art approach GCPN. After fine-tuning the model for goal-directed property optimization with reinforcement learning, GraphAF achieves state-of-the-art performance on both chemical property optimization and constrained property optimization.
Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.