A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data are becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don't. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don't perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes.
Physics-informed neural networks (PINN) is a extremely powerful paradigm used to solve equations encountered in scientific computing applications. An important part of the procedure is the minimization of the equation residual which includes, when the equation is time-dependent, a time sampling. It was argued in the literature that the sampling need not be uniform but should overweight initial time instants, but no rigorous explanation was provided for these choice. In this paper we take some prototypical examples and, under standard hypothesis concerning the neural network convergence, we show that the optimal time sampling follows a truncated exponential distribution. In particular we explain when the time sampling is best to be uniform and when it should not be. The findings are illustrated with numerical examples on linear equation, Burgers' equation and the Lorenz system.
Many problems in high-dimensional statistics appear to have a statistical-computational gap: a range of values of the signal-to-noise ratio where inference is information-theoretically possible, but (conjecturally) computationally intractable. A canonical such problem is Tensor PCA, where we observe a tensor $Y$ consisting of a rank-one signal plus Gaussian noise. Multiple lines of work suggest that Tensor PCA becomes computationally hard at a critical value of the signal's magnitude. In particular, below this transition, no low-degree polynomial algorithm can detect the signal with high probability; conversely, various spectral algorithms are known to succeed above this transition. We unify and extend this work by considering tensor networks, orthogonally invariant polynomials where multiple copies of $Y$ are "contracted" to produce scalars, vectors, matrices, or other tensors. We define a new set of objects, tensor cumulants, which provide an explicit, near-orthogonal basis for invariant polynomials of a given degree. This basis lets us unify and strengthen previous results on low-degree hardness, giving a combinatorial explanation of the hardness transition and of a continuum of subexponential-time algorithms that work below it, and proving tight lower bounds against low-degree polynomials for recovering rather than just detecting the signal. It also lets us analyze a new problem of distinguishing between different tensor ensembles, such as Wigner and Wishart tensors, establishing a sharp computational threshold and giving evidence of a new statistical-computational gap in the Central Limit Theorem for random tensors. Finally, we believe these cumulants are valuable mathematical objects in their own right: they generalize the free cumulants of free probability theory from matrices to tensors, and share many of their properties, including additivity under additive free convolution.
Singular-value statistics (SVS) has been recently presented as a random matrix theory tool able to properly characterize non-Hermitian random matrix ensembles [PRX Quantum {\bf 4}, 040312 (2023)]. Here, we perform a numerical study of the SVS of the non-Hermitian adjacency matrices $\mathbf{A}$ of directed random graphs, where $\mathbf{A}$ are members of diluted real Ginibre ensembles. We consider two models of directed random graphs: Erd\"os-R\'enyi graphs and random regular graphs. Specifically, we focus on the ratio $r$ between nearest neighbor singular values and the minimum singular value $\lambda_{min}$. We show that $\langle r \rangle$ (where $\langle \cdot \rangle$ represents ensemble average) can effectively characterize the transition between mostly isolated vertices to almost complete graphs, while the probability density function of $\lambda_{min}$ can clearly distinguish between different graph models.
Two sequential estimators are proposed for the odds p/(1-p) and log odds log(p/(1-p)) respectively, using independent Bernoulli random variables with parameter p as inputs. The estimators are unbiased, and guarantee that the variance of the estimation error divided by the true value of the odds, or the variance of the estimation error of the log odds, are less than a target value for any p in (0,1). The estimators are close to optimal in the sense of Wolfowitz's bound.
Our research delves into the balance between maintaining privacy and preserving statistical accuracy when dealing with multivariate data that is subject to \textit{componentwise local differential privacy} (CLDP). With CLDP, each component of the private data is made public through a separate privacy channel. This allows for varying levels of privacy protection for different components or for the privatization of each component by different entities, each with their own distinct privacy policies. We develop general techniques for establishing minimax bounds that shed light on the statistical cost of privacy in this context, as a function of the privacy levels $\alpha_1, ... , \alpha_d$ of the $d$ components. We demonstrate the versatility and efficiency of these techniques by presenting various statistical applications. Specifically, we examine nonparametric density and covariance estimation under CLDP, providing upper and lower bounds that match up to constant factors, as well as an associated data-driven adaptive procedure. Furthermore, we quantify the probability of extracting sensitive information from one component by exploiting the fact that, on another component which may be correlated with the first, a smaller degree of privacy protection is guaranteed.
A neural architecture with randomly initialized weights, in the infinite width limit, is equivalent to a Gaussian Random Field whose covariance function is the so-called Neural Network Gaussian Process kernel (NNGP). We prove that a reproducing kernel Hilbert space (RKHS) defined by the NNGP contains only functions that can be approximated by the architecture. To achieve a certain approximation error the required number of neurons in each layer is defined by the RKHS norm of the target function. Moreover, the approximation can be constructed from a supervised dataset by a random multi-layer representation of an input vector, together with training of the last layer's weights. For a 2-layer NN and a domain equal to an $n-1$-dimensional sphere in ${\mathbb R}^n$, we compare the number of neurons required by Barron's theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than $k^{-n-\frac{2}{3}}$ where $k$ is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron's theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees.
Graph powers are a well-studied concept in graph theory. Analogous to graph powers, Chandran et al.[3] introduced the concept of bipartite powers for bipartite graphs. In this paper, we will demonstrate that some well-known classes of bipartite graphs, namely the interval bigraphs, proper interval bigraphs, and bigraphs of Ferrers dimension 2, are closed under the operation of taking bipartite powers. Finally, we define strongly closed property for bipartite graphs under powers and have shown that the class of chordal bipartite graphs is strongly closed under powers.
The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is shown that the components of the population version of the estimator correspond to the well separated components of $P$. This provides some theoretical justification for the use of such estimators for cluster analysis in case that $P$ has well separated subpopulations even if these subpopulations differ from what the mixture model assumes.
The concept of shift is often invoked to describe directional differences in statistical moments but has not yet been established as a property of individual distributions. In the present study, we define distributional shift (DS) as the concentration of frequencies towards the lowest discrete class and derive its measurement from the sum of cumulative frequencies. We use empirical datasets to demonstrate DS as an advantageous measure of ecological rarity and as a generalisable measure of poverty and scarcity. We then define relative distributional shift (RDS) as the difference in DS between distributions, yielding a uniquely signed (i.e., directional) measure. Using simulated random sampling, we show that RDS is closely related to measures of distance, divergence, intersection, and probabilistic scoring. We apply RDS to image analysis by demonstrating its performance in the detection of light events, changes in complex patterns, patterns within visual noise, and colour shifts. Altogether, DS is an intuitive statistical property that underpins a uniquely useful comparative measure.
Knowledge graphs (KGs) of real-world facts about entities and their relationships are useful resources for a variety of natural language processing tasks. However, because knowledge graphs are typically incomplete, it is useful to perform knowledge graph completion or link prediction, i.e. predict whether a relationship not in the knowledge graph is likely to be true. This paper serves as a comprehensive survey of embedding models of entities and relationships for knowledge graph completion, summarizing up-to-date experimental results on standard benchmark datasets and pointing out potential future research directions.