Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where the train accuracy suddenly jumps from $100\%$ to $100 (1-\xi)\%$.
Generalization to new samples is a fundamental rationale for statistical modeling. For this purpose, model validation is particularly important, but recent work in survey inference has suggested that simple aggregation of individual prediction scores does not give a good measure of the score for population aggregate estimates. In this manuscript we explain why this occurs, propose two scoring metrics designed specifically for this problem, and demonstrate their use in three different ways. We show that these scoring metrics correctly order models when compared to the true score, although they do underestimate the magnitude of the score. We demonstrate with a problem in survey research, where multilevel regression and poststratification (MRP) has been used extensively to adjust convenience and low-response surveys to make population and subpopulation estimates.
In this work we consider the Allen--Cahn equation, a prototypical model problem in nonlinear dynamics that exhibits bifurcations corresponding to variations of a deterministic bifurcation parameter. Going beyond the state-of-the-art, we introduce a random coefficient function in the linear reaction part of the equation, thereby accounting for random, spatially-heterogeneous effects. Importantly, we assume a spatially constant, deterministic mean value of the random coefficient. We show that this mean value is in fact a bifurcation parameter in the Allen--Cahn equation with random coefficients. Moreover, we show that the bifurcation points and bifurcation curves become random objects. We consider two distinct modelling situations: (i) for a spatially homogeneous coefficient we derive analytical expressions for the distribution of the bifurcation points and show that the bifurcation curves are random shifts of a fixed reference curve; (ii) for a spatially heterogeneous coefficient we employ a generalized polynomial chaos expansion to approximate the statistical properties of the random bifurcation points and bifurcation curves. We present numerical examples in 1D physical space, where we combine the popular software package Continuation Core and Toolboxes (CoCo) for numerical continuation and the Sparse Grids Matlab Kit for the polynomial chaos expansion. Our exposition addresses both, dynamical systems and uncertainty quantification, highlighting how analytical and numerical tools from both areas can be combined efficiently for the challenging uncertainty quantification analysis of bifurcations in random differential equations.
Bayesian factor analysis is routinely used for dimensionality reduction in modeling of high-dimensional covariance matrices. Factor analytic decompositions express the covariance as a sum of a low rank and diagonal matrix. In practice, Gibbs sampling algorithms are typically used for posterior computation, alternating between updating the latent factors, loadings, and residual variances. In this article, we exploit a blessing of dimensionality to develop a provably accurate pseudo-posterior for the covariance matrix that bypasses the need for Gibbs or other variants of Markov chain Monte Carlo sampling. Our proposed Factor Analysis with BLEssing of dimensionality (FABLE) approach relies on a first-stage singular value decomposition (SVD) to estimate the latent factors, and then defines a jointly conjugate prior for the loadings and residual variances. The accuracy of the resulting pseudo-posterior for the covariance improves with increasing dimensionality. We show that FABLE has excellent performance in high-dimensional covariance matrix estimation, including producing well calibrated credible intervals, both theoretically and through simulation experiments. We also demonstrate the strength of our approach in terms of accurate inference and computational efficiency by applying it to a gene expression data set.
Bilevel optimization, with broad applications in machine learning, has an intricate hierarchical structure. Gradient-based methods have emerged as a common approach to large-scale bilevel problems. However, the computation of the hyper-gradient, which involves a Hessian inverse vector product, confines the efficiency and is regarded as a bottleneck. To circumvent the inverse, we construct a sequence of low-dimensional approximate Krylov subspaces with the aid of the Lanczos process. As a result, the constructed subspace is able to dynamically and incrementally approximate the Hessian inverse vector product with less effort and thus leads to a favorable estimate of the hyper-gradient. Moreover, we propose a~provable subspace-based framework for bilevel problems where one central step is to solve a small-size tridiagonal linear system. To the best of our knowledge, this is the first time that subspace techniques are incorporated into bilevel optimization. This successful trial not only enjoys $\mathcal{O}(\epsilon^{-1})$ convergence rate but also demonstrates efficiency in a synthetic problem and two deep learning tasks.
Differential abundance analysis is a key component of microbiome studies. It focuses on the task of assessing the magnitude and statistical significance of differences in microbial abundances between conditions. While dozens of methods for differential abundance analysis exist, they have been reported to produce remarkably discordant results. Currently, there is no consensus on the preferred methods. While correctness of results in differential abundance analysis is an ambiguous concept that cannot be evaluated without employing simulated data, we argue that consistency of results across datasets should be considered as an essential quality of a well-performing method. We compared the performance of 13 differential abundance analysis methods employing datasets from multiple (N = 54) taxonomic profiling studies based on 16S rRNA gene or shotgun sequencing. For each method, we examined how the results replicated between random partitions of each dataset and between datasets from independent studies. While certain methods showed good consistency, some widely used methods were observed to make a substantial number of conflicting findings. Overall, the highest consistency without unnecessary reduction in sensitivity was attained by analyzing total sum scaling (TSS) normalized counts with a non-parametric method (Wilcoxon test or ordinal regression model) or linear regression (MaAsLin2). Comparable performance was also attained by analyzing presence/absence of taxa with logistic regression. In conclusion, while numerous sophisticated methods for differential abundance analysis have been developed, elementary methods seem to provide more consistent results without unnecessarily compromising sensitivity. We therefore suggest that the elementary methods should be preferred in microbial differential abundance analysis when replicability needs to be emphasized.
Functional data analysis is an important research field in statistics which treats data as random functions drawn from some infinite-dimensional functional space, and functional principal component analysis (FPCA) based on eigen-decomposition plays a central role for data reduction and representation. After nearly three decades of research, there remains a key problem unsolved, namely, the perturbation analysis of covariance operator for diverging number of eigencomponents obtained from noisy and discretely observed data. This is fundamental for studying models and methods based on FPCA, while there has not been substantial progress since Hall, M\"uller and Wang (2006)'s result for a fixed number of eigenfunction estimates. In this work, we aim to establish a unified theory for this problem, obtaining upper bounds for eigenfunctions with diverging indices in both the $\mathcal{L}^2$ and supremum norms, and deriving the asymptotic distributions of eigenvalues for a wide range of sampling schemes. Our results provide insight into the phenomenon when the $\mathcal{L}^{2}$ bound of eigenfunction estimates with diverging indices is minimax optimal as if the curves are fully observed, and reveal the transition of convergence rates from nonparametric to parametric regimes in connection to sparse or dense sampling. We also develop a double truncation technique to handle the uniform convergence of estimated covariance and eigenfunctions. The technical arguments in this work are useful for handling the perturbation series with noisy and discretely observed functional data and can be applied in models or those involving inverse problems based on FPCA as regularization, such as functional linear regression.
The beneficial role of noise-injection in learning is a consolidated concept in the field of artificial neural networks, suggesting that even biological systems might take advantage of similar mechanisms to optimize their performance. The training-with-noise algorithm proposed by Gardner and collaborators is an emblematic example of a noise-injection procedure in recurrent networks, which can be used to model biological neural systems. We show how adding structure to noisy training data can substantially improve the algorithm performance, allowing the network to approach perfect retrieval of the memories and wide basins of attraction, even in the scenario of maximal injected noise. We also prove that the so-called Hebbian Unlearning rule coincides with the training-with-noise algorithm when noise is maximal and data are stable fixed points of the network dynamics.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.