It has become an increasingly common practice for scientists in modern science and engineering to collect samples of multiple network data in which a network serves as a basic data object. The increasing prevalence of multiple network data calls for developments of models and theory that can deal with inference problems for populations of networks. In this work, we propose a general procedure for hypothesis testing of networks and in particular, for differentiating distributions of two samples of networks. We consider a very general framework which allows us to perform tests on large and sparse networks. Our contribution is two-fold: (1) We propose a test statistics based on the singular value of a generalized Wigner matrix. The asymptotic null distribution of the statistics is shown to follow the Tracy--Widom distribution as the number of nodes tends to infinity. The test also yields asymptotic power guarantee with the power tending to one under the alternative; (2) The test procedure is adapted for change-point detection in dynamic networks which is proven to be consistent in detecting the change-points. In addition to theoretical guarantees, another appealing feature of this adapted procedure is that it provides a principled and simple method for selecting the threshold that is also allowed to vary with time. Extensive simulation studies and real data analyses demonstrate the superior performance of our procedure with competitors.
The mean field (MF) theory of multilayer neural networks centers around a particular infinite-width scaling, where the learning dynamics is closely tracked by the MF limit. A random fluctuation around this infinite-width limit is expected from a large-width expansion to the next order. This fluctuation has been studied only in shallow networks, where previous works employ heavily technical notions or additional formulation ideas amenable only to that case. Treatment of the multilayer case has been missing, with the chief difficulty in finding a formulation that captures the stochastic dependency across not only time but also depth. In this work, we initiate the study of the fluctuation in the case of multilayer networks, at any network depth. Leveraging on the neuronal embedding framework recently introduced by Nguyen and Pham, we systematically derive a system of dynamical equations, called the second-order MF limit, that captures the limiting fluctuation distribution. We demonstrate through the framework the complex interaction among neurons in this second-order MF limit, the stochasticity with cross-layer dependency and the nonlinear time evolution inherent in the limiting fluctuation. A limit theorem is proven to relate quantitatively this limit to the fluctuation of large-width networks. We apply the result to show a stability property of gradient descent MF training: in the large-width regime, along the training trajectory, it progressively biases towards a solution with "minimal fluctuation" (in fact, vanishing fluctuation) in the learned output function, even after the network has been initialized at or has converged (sufficiently fast) to a global optimum. This extends a similar phenomenon previously shown only for shallow networks with a squared loss in the ERM setting, to multilayer networks with a loss function that is not necessarily convex in a more general setting.
We consider Bayesian multiple hypothesis problem with independent and identically distributed observations. The classical, Sanov's theorem-based, analysis of the error probability allows one to characterize the best achievable error exponent. However, this analysis does not generalize to the case where the true distributions of the hypothesis are not exact or partially known via some nominal distributions. This problem has practical significance, because the nominal distributions may be quantized versions of the true distributions in a hardware implementation, or they may be estimates of the true distributions obtained from labeled training sequences as in statistical classification. In this paper, we develop a type-based analysis to investigate Bayesian multiple hypothesis testing problem. Our analysis allows one to explicitly calculate the error exponent of a given type and extends the classical analysis. As a generalization of the proposed method, we derive a robust test and obtain its error exponent for the case where the hypothesis distributions are not known but there exist nominal distribution that are close to true distributions in variational distance.
The error threshold of a one-parameter family of quantum channels is defined as the largest noise level such that the quantum capacity of the channel remains positive. This in turn guarantees the existence of a quantum error correction code for noise modeled by that channel. Discretizing the single-qubit errors leads to the important family of Pauli quantum channels; curiously, multipartite entangled states can increase the threshold of these channels beyond the so-called hashing bound, an effect termed superadditivity of coherent information. In this work, we divide the simplex of Pauli channels into one-parameter families and compute numerical lower bounds on their error thresholds. We find substantial increases of error thresholds relative to the hashing bound for large regions in the Pauli simplex corresponding to biased noise, which is a realistic noise model in promising quantum computing architectures. The error thresholds are computed on the family of graph states, a special type of stabilizer state. In order to determine the coherent information of a graph state, we devise an algorithm that exploits the symmetries of the underlying graph, resulting in a substantial computational speed-up. This algorithm uses tools from computational group theory and allows us to consider symmetric graph states on a large number of vertices. Our algorithm works particularly well for repetition codes and concatenated repetition codes (or cat codes), for which our results provide the first comprehensive study of superadditivity for arbitrary Pauli channels. In addition, we identify a novel family of quantum codes based on tree graphs. The error thresholds of these tree graph states outperform repetition and cat codes in large regions of the Pauli simplex, and hence form a new code family with desirable error correction properties.
Graph Convolutional Networks (GCNs) are known to suffer from performance degradation as the number of layers increases, which is usually attributed to over-smoothing. Despite the apparent consensus, we observe that there exists a discrepancy between the theoretical understanding of over-smoothing and the practical capabilities of GCNs. Specifically, we argue that over-smoothing does not necessarily happen in practice, a deeper model is provably expressive, can converge to global optimum with linear convergence rate, and achieve very high training accuracy as long as properly trained. Despite being capable of achieving high training accuracy, empirical results show that the deeper models generalize poorly on the testing stage and existing theoretical understanding of such behavior remains elusive. To achieve better understanding, we carefully analyze the generalization capability of GCNs, and show that the training strategies to achieve high training accuracy significantly deteriorate the generalization capability of GCNs. Motivated by these findings, we propose a decoupled structure for GCNs that detaches weight matrices from feature propagation to preserve the expressive power and ensure good generalization performance. We conduct empirical evaluations on various synthetic and real-world datasets to validate the correctness of our theory.
We consider the problem of testing for long-range dependence for time-varying coefficient regression models. The covariates and errors are assumed to be locally stationary, which allows complex temporal dynamics and heteroscedasticity. We develop KPSS, R/S, V/S, and K/S-type statistics based on the nonparametric residuals, and propose bootstrap approaches equipped with a difference-based long-run covariance matrix estimator for practical implementation. Under the null hypothesis, the local alternatives as well as the fixed alternatives, we derive the limiting distributions of the test statistics, establish the uniform consistency of the difference-based long-run covariance estimator, and justify the bootstrap algorithms theoretically. In particular, the exact local asymptotic power of our testing procedure enjoys the order $O( \log^{-1} n)$, the same as that of the classical KPSS test for long memory in strictly stationary series without covariates. We demonstrate the effectiveness of our tests by extensive simulation studies. The proposed tests are applied to a COVID-19 dataset in favor of long-range dependence in the cumulative confirmed series of COVID-19 in several countries, and to the Hong Kong circulatory and respiratory dataset, identifying a new type of 'spurious long memory'.
Graph Neural Networks (GNNs) have emerged as a flexible and powerful approach for learning over graphs. Despite this success, existing GNNs are constrained by their local message-passing architecture and are provably limited in their expressive power. In this work, we propose a new GNN architecture -- the Neural Tree. The neural tree architecture does not perform message passing on the input graph, but on a tree-structured graph, called the H-tree, that is constructed from the input graph. Nodes in the H-tree correspond to subgraphs in the input graph, and they are reorganized in a hierarchical manner such that the parent of a node in the H-tree always corresponds to a larger subgraph in the input graph. We show that the neural tree architecture can approximate any smooth probability distribution function over an undirected graph. We also prove that the number of parameters needed to achieve an $\epsilon$-approximation of the distribution function is exponential in the treewidth of the input graph, but linear in its size. We prove that any continuous $\mathcal{G}$-invariant/equivariant function can be approximated by a nonlinear combination of such probability distribution functions over $\mathcal{G}$. We apply the neural tree to semi-supervised node classification in 3D scene graphs, and show that these theoretical properties translate into significant gains in prediction accuracy, over the more traditional GNN architectures. We also show the applicability of the neural tree architecture to citation networks with large treewidth, by using a graph sub-sampling technique.
In the realm of deep learning, the Fisher information matrix (FIM) gives novel insights and useful tools to characterize the loss landscape, perform second-order optimization, and build geometric learning theories. The exact FIM is either unavailable in closed form or too expensive to compute. In practice, it is almost always estimated based on empirical samples. We investigate two such estimators based on two equivalent representations of the FIM -- both unbiased and consistent. Their estimation quality is naturally gauged by their variance given in closed form. We analyze how the parametric structure of a deep neural network can affect the variance. The meaning of this variance measure and its upper bounds are then discussed in the context of deep learning.
In data analysis problems where we are not able to rely on distributional assumptions, what types of inference guarantees can still be obtained? Many popular methods, such as holdout methods, cross-validation methods, and conformal prediction, are able to provide distribution-free guarantees for predictive inference, but the problem of providing inference for the underlying regression function (for example, inference on the conditional mean $\mathbb{E}[Y|X]$) is more challenging. In the setting where the features $X$ are continuously distributed, recent work has established that any confidence interval for $\mathbb{E}[Y|X]$ must have non-vanishing width, even as sample size tends to infinity. At the other extreme, if $X$ takes only a small number of possible values, then inference on $\mathbb{E}[Y|X]$ is trivial to achieve. In this work, we study the problem in settings in between these two extremes. We find that there are several distinct regimes in between the finite setting and the continuous setting, where vanishing-width confidence intervals are achievable if and only if the effective support size of the distribution of $X$ is smaller than the square of the sample size.
We study a variation of Bayesian M-ary hypothesis testing in which the test outputs a list of L candidates out of the M possible upon processing the observation. We study the minimum error probability of list hypothesis testing, where an error is defined as the event where the true hypothesis is not in the list output by the test. We derive two exact expressions of the minimum probability or error. The first is expressed as the error probability of a certain non-Bayesian binary hypothesis test, and is reminiscent of the meta-converse bound. The second, is expressed as the tail probability of the likelihood ratio between the two distributions involved in the aforementioned non-Bayesian binary hypothesis test.
We propose a Bayesian convolutional neural network built upon Bayes by Backprop and elaborate how this known method can serve as the fundamental construct of our novel, reliable variational inference method for convolutional neural networks. First, we show how Bayes by Backprop can be applied to convolutional layers where weights in filters have probability distributions instead of point-estimates; and second, how our proposed framework leads with various network architectures to performances comparable to convolutional neural networks with point-estimates weights. In the past, Bayes by Backprop has been successfully utilised in feedforward and recurrent neural networks, but not in convolutional ones. This work symbolises the extension of the group of Bayesian neural networks which encompasses all three aforementioned types of network architectures now.