In the age of big data, data integration is a critical step especially in the understanding of how diverse data types work together and work separately. Among the data integration methods, the Angle-Based Joint and Individual Variation Explained (AJIVE) is particularly attractive because it not only studies joint behavior but also individual behavior. Typically scores indicate relationships between data objects. The drivers of those relationships are determined by the loadings. A fundamental question is which loadings are statistically significant. A useful approach for assessing this is the jackstraw method. In this paper, we develop jackstraw for the loadings of the AJIVE data analysis. This provides statistical inference about the drivers in both joint and individual feature spaces.
We provide four case studies that use Bayesian machinery to making inductive reasoning. Our main motivation relies in offering several instances where the Bayesian approach to data analysis is exploited at its best to perform complex tasks, such as description, testing, estimation, and prediction. This work is not meant to be either a reference text or a survey in Bayesian statistical inference. Our goal is simply to provide several examples that use Bayesian methodology to solve data-driven problems. The topics we cover here, include problems in Bayesian nonparametrics, Bayesian analysis of times series, and Bayesian analysis of spatial data.
In this paper we introduce a novel Bayesian approach for linking multiple social networks in order to discover the same real world person having different accounts across networks. In particular, we develop a latent model that allow us to jointly characterize the network and linkage structures relying in both relational and profile data. In contrast to other existing approaches in the machine learning literature, our Bayesian implementation naturally provides uncertainty quantification via posterior probabilities for the linkage structure itself or any function of it. Our findings clearly suggest that our methodology can produce accurate point estimates of the linkage structure even in the absence of profile information, and also, in an identity resolution setting, our results confirm that including relational data into the matching process improves the linkage accuracy. We illustrate our methodology using real data from popular social networks such as Twitter, Facebook, and YouTube.
Statistical machine learning models trained with stochastic gradient algorithms are increasingly being deployed in critical scientific applications. However, computing the stochastic gradient in several such applications is highly expensive or even impossible at times. In such cases, derivative-free or zeroth-order algorithms are used. An important question which has thus far not been addressed sufficiently in the statistical machine learning literature is that of equipping stochastic zeroth-order algorithms with practical yet rigorous inferential capabilities so that we not only have point estimates or predictions but also quantify the associated uncertainty via confidence intervals or sets. Towards this, in this work, we first establish a central limit theorem for Polyak-Ruppert averaged stochastic zeroth-order gradient algorithm. We then provide online estimators of the asymptotic covariance matrix appearing in the central limit theorem, thereby providing a practical procedure for constructing asymptotically valid confidence sets (or intervals) for parameter estimation (or prediction) in the zeroth-order setting.
Dynamically typed languages such as JavaScript and Python have emerged as the most popular programming languages in use. Important benefits can accrue from including type annotations in dynamically typed programs. This approach to gradual typing is exemplified by the TypeScript programming system which allows programmers to specify partially typed programs, and then uses static analysis to infer the remaining types. However, in general, the effectiveness of static type inference is limited and depends on the complexity of the program's structure and the initial type annotations. As a result, there is a strong motivation for new approaches that can advance the state of the art in statically predicting types in dynamically typed programs, and that do so with acceptable performance for use in interactive programming environments. Previous work has demonstrated the promise of probabilistic type inference using deep learning. In this paper, we advance past work by introducing a range of graph neural network (GNN) models that operate on a novel type flow graph (TFG) representation. The TFG represents an input program's elements as graph nodes connected with syntax edges and data flow edges, and our GNN models are trained to predict the type labels in the TFG for a given input program. We study different design choices for our GNN models for the 100 most common types in our evaluation dataset, and show that our best two GNN configurations for accuracy achieve a top-1 accuracy of 87.76% and 86.89% respectively, outperforming the two most closely related deep learning type inference approaches from past work -- DeepTyper with a top-1 accuracy of 84.62% and LambdaNet with a top-1 accuracy of 79.45%. Further, the average inference throughputs of those two configurations are 353.8 and 1,303.9 files/second, compared to 186.7 files/second for DeepTyper and 1,050.3 files/second for LambdaNet.
The analysis left truncated and right censored data is very common in survival and reliability analysis. In lifetime studies patients often subject to left truncation in addition to right censoring. For example, in bone marrow transplant studies based on International Bone Marrow Transplant Registry (IBMTR), the patients who die while waiting for the transplants will not be reported to the IBMTR. In this paper, we develop novel U-statistics under left truncation and right censoring. We prove the $\sqrt{n}$-consistency of the proposed U-statistics. We derive the asymptotic distribution of the U-statistics using counting process technique. As an application of the U-statistics, we develop a simple non-parametric test for testing the independence between time to failure and cause of failure in competing risks when the observations are subject to left truncation and right censoring. The finite sample performance of the proposed test is evaluated through Monte Carlo simulation study. Finally we illustrate our test procedure using lifetime data of transformers.
The autoregressive (AR) models are used to represent the time-varying random process in which output depends linearly on previous terms and a stochastic term (the innovation). In the classical version, the AR models are based on normal distribution. However, this distribution does not allow describing data with outliers and asymmetric behavior. In this paper, we study the AR models with normal inverse Gaussian (NIG) innovations. The NIG distribution belongs to the class of semi heavy-tailed distributions with wide range of shapes and thus allows for describing real-life data with possible jumps. The expectation-maximization (EM) algorithm is used to estimate the parameters of the considered model. The efficacy of the estimation procedure is shown on the simulated data. A comparative study is presented, where the classical estimation algorithms are also incorporated, namely, Yule-Walker and conditional least squares methods along with EM method for model parameters estimation. The applications of the introduced model are demonstrated on the real-life financial data.
This paper investigates the problem of online statistical inference of model parameters in stochastic optimization problems via the Kiefer-Wolfowitz algorithm with random search directions. We first present the asymptotic distribution for the Polyak-Ruppert-averaging type Kiefer-Wolfowitz (AKW) estimators, whose asymptotic covariance matrices depend on the function-value query complexity and the distribution of search directions. The distributional result reflects the trade-off between statistical efficiency and function query complexity. We further analyze the choices of random search directions to minimize the asymptotic covariance matrix, and conclude that the optimal search direction depends on the optimality criteria with respect to different summary statistics of the Fisher information matrix. Based on the asymptotic distribution result, we conduct online statistical inference by providing two construction procedures of valid confidence intervals. We provide numerical experiments verifying our theoretical results with the practical effectiveness of the procedures.
Community detection, a fundamental task for network analysis, aims to partition a network into multiple sub-structures to help reveal their latent functions. Community detection has been extensively studied in and broadly applied to many real-world network problems. Classical approaches to community detection typically utilize probabilistic graphical models and adopt a variety of prior knowledge to infer community structures. As the problems that network methods try to solve and the network data to be analyzed become increasingly more sophisticated, new approaches have also been proposed and developed, particularly those that utilize deep learning and convert networked data into low dimensional representation. Despite all the recent advancement, there is still a lack of insightful understanding of the theoretical and methodological underpinning of community detection, which will be critically important for future development of the area of network analysis. In this paper, we develop and present a unified architecture of network community-finding methods to characterize the state-of-the-art of the field of community detection. Specifically, we provide a comprehensive review of the existing community detection methods and introduce a new taxonomy that divides the existing methods into two categories, namely probabilistic graphical model and deep learning. We then discuss in detail the main idea behind each method in the two categories. Furthermore, to promote future development of community detection, we release several benchmark datasets from several problem domains and highlight their applications to various network analysis tasks. We conclude with discussions of the challenges of the field and suggestions of possible directions for future research.
Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).
Over the past few years, we have seen fundamental breakthroughs in core problems in machine learning, largely driven by advances in deep neural networks. At the same time, the amount of data collected in a wide array of scientific domains is dramatically increasing in both size and complexity. Taken together, this suggests many exciting opportunities for deep learning applications in scientific settings. But a significant challenge to this is simply knowing where to start. The sheer breadth and diversity of different deep learning techniques makes it difficult to determine what scientific problems might be most amenable to these methods, or which specific combination of methods might offer the most promising first approach. In this survey, we focus on addressing this central issue, providing an overview of many widely used deep learning models, spanning visual, sequential and graph structured data, associated tasks and different training methods, along with techniques to use deep learning with less data and better interpret these complex models --- two central considerations for many scientific use cases. We also include overviews of the full design process, implementation tips, and links to a plethora of tutorials, research summaries and open-sourced deep learning pipelines and pretrained models, developed by the community. We hope that this survey will help accelerate the use of deep learning across different scientific domains.