This paper investigates Gaussian copula mixture models (GCMM), which are an extension of Gaussian mixture models (GMM) that incorporate copula concepts. The paper presents the mathematical definition of GCMM and explores the properties of its likelihood function. Additionally, the paper proposes extended Expectation Maximum algorithms to estimate parameters for the mixture of copulas. The marginal distributions corresponding to each component are estimated separately using nonparametric statistical methods. In the experiment, GCMM demonstrates improved goodness-of-fitting compared to GMM when using the same number of clusters. Furthermore, GCMM has the ability to leverage un-synchronized data across dimensions for more comprehensive data analysis.
Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, utilizing the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem's analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes.
The number of modes in a probability density function is representative of the model's complexity and can also be viewed as the number of existing subpopulations. Despite its relevance, little research has been devoted to its estimation. Focusing on the univariate setting, we propose a novel approach targeting prediction accuracy inspired by some overlooked aspects of the problem. We argue for the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view blending global and local density properties. Our method builds upon a combination of flexible kernel estimators and parsimonious compositional splines. Feature exploration, model selection and mode testing are implemented in the Bayesian inference paradigm, providing soft solutions and allowing to incorporate expert judgement in the process. The usefulness of our proposal is illustrated through a case study in sports analytics, showcasing multiple companion visualisation tools. A thorough simulation study demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, our method emerges as a top-tier alternative offering innovative solutions for analysts.
The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points. Yet, they can be inaccurate under high-dimensional noise, especially if the noise magnitude varies considerably across the data, e.g., under heteroskedasticity or outliers. In this work, we investigate a more robust alternative -- the doubly stochastic normalization of the Gaussian kernel. We consider a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish that the doubly stochastic affinity matrix and its scaling factors concentrate around certain population forms, and provide corresponding finite-sample probabilistic error bounds. We then utilize these results to develop several tools for robust inference under general high-dimensional noise. First, we derive a robust density estimator that reliably infers the underlying sampling density and can substantially outperform the standard kernel density estimator under heteroskedasticity and outliers. Second, we obtain estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that accurately approximate various manifold Laplacians, including the Laplace Beltrami operator, improving over traditional normalizations in noisy settings. We exemplify our results in simulations and on real single-cell RNA-sequencing data. For the latter, we show that in contrast to traditional methods, our approach is robust to variability in technical noise levels across cell types.
We provide a rigorous analysis of training by variational inference (VI) of Bayesian neural networks in the two-layer and infinite-width case. We consider a regression problem with a regularized evidence lower bound (ELBO) which is decomposed into the expected log-likelihood of the data and the Kullback-Leibler (KL) divergence between the a priori distribution and the variational posterior. With an appropriate weighting of the KL, we prove a law of large numbers for three different training schemes: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes by Backprop, and (iii) a new and computationally cheaper algorithm which we introduce as Minimal VI. An important result is that all methods converge to the same mean-field limit. Finally, we illustrate our results numerically and discuss the need for the derivation of a central limit theorem.
We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting.
Data objects taking value in a general metric space have become increasingly common in modern data analysis. In this paper, we study two important statistical inference problems, namely, two-sample testing and change-point detection, for such non-Euclidean data under temporal dependence. Typical examples of non-Euclidean valued time series include yearly mortality distributions, time-varying networks, and covariance matrix time series. To accommodate unknown temporal dependence, we advance the self-normalization (SN) technique (Shao, 2010) to the inference of non-Euclidean time series, which is substantially different from the existing SN-based inference for functional time series that reside in Hilbert space (Zhang et al., 2011). Theoretically, we propose new regularity conditions that could be easier to check than those in the recent literature, and derive the limiting distributions of the proposed test statistics under both null and local alternatives. For change-point detection problem, we also derive the consistency for the change-point location estimator, and combine our proposed change-point test with wild binary segmentation to perform multiple change-point estimation. Numerical simulations demonstrate the effectiveness and robustness of our proposed tests compared with existing methods in the literature. Finally, we apply our tests to two-sample inference in mortality data and change-point detection in cryptocurrency data.
The logistic regression model is one of the most popular data generation model in noisy binary classification problems. In this work, we study the sample complexity of estimating the parameters of the logistic regression model up to a given $\ell_2$ error, in terms of the dimension and the inverse temperature, with standard normal covariates. The inverse temperature controls the signal-to-noise ratio of the data generation process. While both generalization bounds and asymptotic performance of the maximum-likelihood estimator for logistic regression are well-studied, the non-asymptotic sample complexity that shows the dependence on error and the inverse temperature for parameter estimation is absent from previous analyses. We show that the sample complexity curve has two change-points (or critical points) in terms of the inverse temperature, clearly separating the low, moderate, and high temperature regimes.
Approximating significance scans of searches for new particles in high-energy physics experiments as Gaussian fields is a well-established way to estimate the trials factors required to quantify global significances. We propose a novel, highly efficient method to estimate the covariance matrix of such a Gaussian field. The method is based on the linear approximation of statistical fluctuations of the signal amplitude. For one-dimensional searches the upper bound on the trials factor can then be calculated directly from the covariance matrix. For higher dimensions, the Gaussian process described by this covariance matrix may be sampled to calculate the trials factor directly. This method also serves as the theoretical basis for a recent study of the trials factor with an empirically constructed set of Asmiov-like background datasets. We illustrate the method with studies of a $H \rightarrow \gamma \gamma$ inspired model that was used in the empirical paper.
Seeking the equivalent entities among multi-source Knowledge Graphs (KGs) is the pivotal step to KGs integration, also known as \emph{entity alignment} (EA). However, most existing EA methods are inefficient and poor in scalability. A recent summary points out that some of them even require several days to deal with a dataset containing 200,000 nodes (DWY100K). We believe over-complex graph encoder and inefficient negative sampling strategy are the two main reasons. In this paper, we propose a novel KG encoder -- Dual Attention Matching Network (Dual-AMN), which not only models both intra-graph and cross-graph information smartly, but also greatly reduces computational complexity. Furthermore, we propose the Normalized Hard Sample Mining Loss to smoothly select hard negative samples with reduced loss shift. The experimental results on widely used public datasets indicate that our method achieves both high accuracy and high efficiency. On DWY100K, the whole running process of our method could be finished in 1,100 seconds, at least 10* faster than previous work. The performances of our method also outperform previous works across all datasets, where Hits@1 and MRR have been improved from 6% to 13%.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.