The problems of selecting partial correlation and causality graphs for count data are considered. A parameter driven generalized linear model is used to describe the observed multivariate time series of counts. Partial correlation and causality graphs corresponding to this model explain the dependencies between each time series of the multivariate count data. In order to estimate these graphs with tunable sparsity, an appropriate likelihood function maximization is regularized with an l1-type constraint. A novel MCEM algorithm is proposed to iteratively solve this regularized MLE. Asymptotic convergence results are proved for the sequence generated by the proposed MCEM algorithm with l1-type regularization. The algorithm is first successfully tested on simulated data. Thereafter, it is applied to observed weekly dengue disease counts from each ward of Greater Mumbai city. The interdependence of various wards in the proliferation of the disease is characterized by the edges of the inferred partial correlation graph. On the other hand, the relative roles of various wards as sources and sinks of dengue spread is quantified by the number and weights of the directed edges originating from and incident upon each ward. From these estimated graphs, it is observed that some special wards act as epicentres of dengue spread even though their disease counts are relatively low.
Traditional methods for inference in change point detection often rely on a large number of observed data points and can be inaccurate in non-asymptotic settings. With the rise of mobile health and digital phenotyping studies, where patients are monitored through the use of smartphones or other digital devices, change point detection is needed in non-asymptotic settings where it may be important to identify behavioral changes that occur just days before an adverse event such as relapse or suicide. Furthermore, analytical and computationally efficient means of inference are necessary for the monitoring and online analysis of large-scale digital phenotyping cohorts. We extend the result for asymptotic tail probabilities of the likelihood ratio test to the multivariate change point detection setting, and demonstrate through simulation its inaccuracy when the number of observed data points is not large. We propose a non-asymptotic approach for inference on the likelihood ratio test, and compare the efficiency of this estimated p-value to the popular empirical p-value obtained through simulation of the null distribution. The accuracy and power of this approach relative to competing methods is demonstrated through simulation and through the detection of a change point in the behavior of a patient with schizophrenia in the week prior to relapse.
In practice, the use of rounding is ubiquitous. Although researchers have looked at the implications of rounding continuous random variables, rounding may be applied to functions of discrete random variables as well. For example, to infer on suicide excess deaths after a national emergency, authorities may provide a rounded average of deaths before and after the emergency started. Suicide rates tend to be relatively low around the world and such rounding may seriously affect inference on the change of suicide rate. In this paper, we study the scenario when a rounded to nearest integer average is used as a proxy for a non-negative discrete random variable. Specifically, our interest is in drawing inference on a parameter from the pmf of Y , when we get U = n[Y /n] as a proxy for Y . The probability generating function of U , E(U ), and Var(U ) capture the effect of the coarsening of the support of Y . Also, moments and estimators of distribution parameters are explored for some special cases. We show that under certain conditions, there is little impact from rounding. However, we also find scenarios where rounding can significantly affect statistical inference as demonstrated in three examples. The simple methods we propose are able to partially counter rounding error effects. While for some probability distributions it may be difficult to derive maximum likelihood estimators as a function of U , we provide a framework to obtain an estimator numerically.
The planted densest subgraph detection problem refers to the task of testing whether in a given (random) graph there is a subgraph that is unusually dense. Specifically, we observe an undirected and unweighted graph on $n$ nodes. Under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi graph with edge probability (or, density) $q$. Under the alternative, there is a subgraph on $k$ vertices with edge probability $p>q$. The statistical as well as the computational barriers of this problem are well-understood for a wide range of the edge parameters $p$ and $q$. In this paper, we consider a natural variant of the above problem, where one can only observe a small part of the graph using adaptive edge queries. For this model, we determine the number of queries necessary and sufficient for detecting the presence of the planted subgraph. Specifically, we show that any (possibly randomized) algorithm must make $\mathsf{Q} = \Omega(\frac{n^2}{k^2\chi^4(p||q)}\log^2n)$ adaptive queries (on expectation) to the adjacency matrix of the graph to detect the planted subgraph with probability more than $1/2$, where $\chi^2(p||q)$ is the Chi-Square distance. On the other hand, we devise a quasi-polynomial-time algorithm that detects the planted subgraph with high probability by making $\mathsf{Q} = O(\frac{n^2}{k^2\chi^4(p||q)}\log^2n)$ non-adaptive queries. We then propose a polynomial-time algorithm which is able to detect the planted subgraph using $\mathsf{Q} = O(\frac{n^3}{k^3\chi^2(p||q)}\log^3 n)$ queries. We conjecture that in the leftover regime, where $\frac{n^2}{k^2}\ll\mathsf{Q}\ll \frac{n^3}{k^3}$, no polynomial-time algorithms exist. Our results resolve two questions posed in \cite{racz2020finding}, where the special case of adaptive detection and recovery of a planted clique was considered.
Inferring the parameters of ordinary differential equations (ODEs) from noisy observations is an important problem in many scientific fields. Currently, most parameter estimation methods that bypass numerical integration tend to rely on basis functions or Gaussian processes to approximate the ODE solution and its derivatives. Due to the sensitivity of the ODE solution to its derivatives, these methods can be hindered by estimation error, especially when only sparse time-course observations are available. We present a Bayesian collocation framework that operates on the integrated form of the ODEs and also avoids the expensive use of numerical solvers. Our methodology has the capability to handle general nonlinear ODE systems. We demonstrate the accuracy of the proposed method through a simulation study, where the estimated parameters and recovered system trajectories are compared with other recent methods. A real data example is also provided.
Time series of counts are frequently analyzed using generalized integer-valued autoregressive models with conditional heteroskedasticity (INGARCH). These models employ response functions to map a vector of past observations and past conditional expectations to the conditional expectation of the present observation. In this paper, it is shown how INGARCH models can be combined with artificial neural network (ANN) response functions to obtain a class of nonlinear INGARCH models. The ANN framework allows for the interpretation of many existing INGARCH models as a degenerate version of a corresponding neural model. Details on maximum likelihood estimation, marginal effects and confidence intervals are given. The empirical analysis of time series of bounded and unbounded counts reveals that the neural INGARCH models are able to outperform reasonable degenerate competitor models in terms of the information loss.
A temporal graph is a graph in which edges are assigned a time label. Two nodes u and v of a temporal graph are connected one to the other if there exists a path from u to v with increasing edge time labels. We consider the problem of assigning time labels to the edges of a digraph in order to maximize the total reachability of the resulting temporal graph (that is, the number of pairs of nodes which are connected one to the other). In particular, we prove that this problem is NP-hard. We then conjecture that the problem is approximable within a constant approximation ratio. This conjecture is a consequence of the following graph theoretic conjecture: any strongly connected directed graph with n nodes admits an out-arborescence and an in-arborescence that are edge-disjoint, have the same root, and each spans $\Omega$(n) nodes.
In the present paper we consider the initial data, external force, viscosity coefficients, and heat conductivity coefficient as random data for the compressible Navier--Stokes--Fourier system. The Monte Carlo method, which is frequently used for the approximation of statistical moments, is combined with a suitable deterministic discretisation method in physical space and time. Under the assumption that numerical densities and temperatures are bounded in probability, we prove the convergence of random finite volume solutions to a statistical strong solution by applying genuine stochastic compactness arguments. Further, we show the convergence and error estimates for the Monte Carlo estimators of the expectation and deviation. We present several numerical results to illustrate the theoretical results.
Zero-inflated continuous data ubiquitously appear in many fields, in which lots of exactly zero-valued data are observed while others distribute continuously. Due to the mixed structure of discreteness and continuity in its distribution, statistical analysis is challenging especially for multivariate case. In this paper, we propose two copula-based density estimation models that can cope with multivariate correlation among zero-inflated continuous variables. In order to overcome the difficulty in the use of copulas due to the tied-data problem in zero-inflated data, we propose a new type of copula, rectified Gaussian copula, and present efficient methods for parameter estimation and likelihood computation. Numerical experiments demonstrates the superiority of our proposals compared to conventional density estimation methods.
Modeling multivariate time series has long been a subject that has attracted researchers from a diverse range of fields including economics, finance, and traffic. A basic assumption behind multivariate time series forecasting is that its variables depend on one another but, upon looking closely, it is fair to say that existing methods fail to fully exploit latent spatial dependencies between pairs of variables. In recent years, meanwhile, graph neural networks (GNNs) have shown high capability in handling relational dependencies. GNNs require well-defined graph structures for information propagation which means they cannot be applied directly for multivariate time series where the dependencies are not known in advance. In this paper, we propose a general graph neural network framework designed specifically for multivariate time series data. Our approach automatically extracts the uni-directed relations among variables through a graph learning module, into which external knowledge like variable attributes can be easily integrated. A novel mix-hop propagation layer and a dilated inception layer are further proposed to capture the spatial and temporal dependencies within the time series. The graph learning, graph convolution, and temporal convolution modules are jointly learned in an end-to-end framework. Experimental results show that our proposed model outperforms the state-of-the-art baseline methods on 3 of 4 benchmark datasets and achieves on-par performance with other approaches on two traffic datasets which provide extra structural information.
The prevalence of networked sensors and actuators in many real-world systems such as smart buildings, factories, power plants, and data centers generate substantial amounts of multivariate time series data for these systems. The rich sensor data can be continuously monitored for intrusion events through anomaly detection. However, conventional threshold-based anomaly detection methods are inadequate due to the dynamic complexities of these systems, while supervised machine learning methods are unable to exploit the large amounts of data due to the lack of labeled data. On the other hand, current unsupervised machine learning approaches have not fully exploited the spatial-temporal correlation and other dependencies amongst the multiple variables (sensors/actuators) in the system for detecting anomalies. In this work, we propose an unsupervised multivariate anomaly detection method based on Generative Adversarial Networks (GANs). Instead of treating each data stream independently, our proposed MAD-GAN framework considers the entire variable set concurrently to capture the latent interactions amongst the variables. We also fully exploit both the generator and discriminator produced by the GAN, using a novel anomaly score called DR-score to detect anomalies by discrimination and reconstruction. We have tested our proposed MAD-GAN using two recent datasets collected from real-world CPS: the Secure Water Treatment (SWaT) and the Water Distribution (WADI) datasets. Our experimental results showed that the proposed MAD-GAN is effective in reporting anomalies caused by various cyber-intrusions compared in these complex real-world systems.