Unraveling the emergence of collective learning in systems of coupled artificial neural networks points to broader implications for machine learning, neuroscience, and society. Here we introduce a minimal model that condenses several recent decentralized algorithms by considering a competition between two terms: the local learning dynamics in the parameters of each neural network unit, and a diffusive coupling among units that tends to homogenize the parameters of the ensemble. We derive an effective theory for linear networks to show that the coarse-grained behavior of our system is equivalent to a deformed Ginzburg-Landau model with quenched disorder. This framework predicts depth-dependent disorder-order-disorder phase transitions in the parameters' solutions that reveal a depth-delayed onset of a collective learning phase and a low-rank microscopic learning path. We validate the theory in coupled ensembles of realistic neural networks trained on the MNIST dataset under privacy constraints. Interestingly, experiments confirm that individual networks -- trained on private data -- can fully generalize to unseen data classes when the collective learning phase emerges. Our work establishes the physics of collective learning and contributes to the mechanistic interpretability of deep learning in decentralized settings.
Laguerre spectral approximations play an important role in the development of efficient algorithms for problems in unbounded domains. In this paper, we present a comprehensive convergence rate analysis of Laguerre spectral approximations for analytic functions. By exploiting contour integral techniques from complex analysis, we prove that Laguerre projection and interpolation methods of degree $n$ converge at the root-exponential rate $O(\exp(-2\rho\sqrt{n}))$ with $\rho>0$ when the underlying function is analytic inside and on a parabola with focus at the origin and vertex at $z=-\rho^2$. As far as we know, this is the first rigorous proof of root-exponential convergence of Laguerre approximations for analytic functions. Several important applications of our analysis are also discussed, including Laguerre spectral differentiations, Gauss-Laguerre quadrature rules, the scaling factor and the Weeks method for the inversion of Laplace transform, and some sharp convergence rate estimates are derived. Numerical experiments are presented to verify the theoretical results.
Multiscale stochastic dynamical systems have been widely adopted to a variety of scientific and engineering problems due to their capability of depicting complex phenomena in many real world applications. This work is devoted to investigating the effective dynamics for slow-fast stochastic dynamical systems. Given observation data on a short-term period satisfying some unknown slow-fast stochastic systems, we propose a novel algorithm including a neural network called Auto-SDE to learn invariant slow manifold. Our approach captures the evolutionary nature of a series of time-dependent autoencoder neural networks with the loss constructed from a discretized stochastic differential equation. Our algorithm is also validated to be accurate, stable and effective through numerical experiments under various evaluation metrics.
We prove that a maintenance problem on frequency-constrained maintenance jobs with a hierarchical structure is integer-factorization hard. This result holds even on simple systems with just two components to maintain. As a corollary, we provide a first hardness result for Levi et al.'s modular maintenance scheduling problem (Naval Research Logistics 61, 472-488, 2014).
We present and analyze an algorithm designed for addressing vector-valued regression problems involving possibly infinite-dimensional input and output spaces. The algorithm is a randomized adaptation of reduced rank regression, a technique to optimally learn a low-rank vector-valued function (i.e. an operator) between sampled data via regularized empirical risk minimization with rank constraints. We propose Gaussian sketching techniques both for the primal and dual optimization objectives, yielding Randomized Reduced Rank Regression (R4) estimators that are efficient and accurate. For each of our R4 algorithms we prove that the resulting regularized empirical risk is, in expectation w.r.t. randomness of a sketch, arbitrarily close to the optimal value when hyper-parameteres are properly tuned. Numerical expreriments illustrate the tightness of our bounds and show advantages in two distinct scenarios: (i) solving a vector-valued regression problem using synthetic and large-scale neuroscience datasets, and (ii) regressing the Koopman operator of a nonlinear stochastic dynamical system.
We discuss a class of coupled system of nonlocal balance laws modeling multilane traffic, with the nonlocality present in both convective and source terms. The uniqueness and existence of the entropy solution is proven via doubling of the variables arguments and convergent finite volume approximations, respectively. The numerical approximations are proven to converge to the unique entropy solution of the system at the rate $\sqrt{\Delta t}$. The applicability of the proven theory to a general class of systems of nonlocal balance laws coupled strongly through the convective part and weakly through the source part, is also indicated. Numerical simulations illustrating the theory and the behavior of the entropy solution as the support of the kernel goes to zero(nonlocal to local limit), are shown.
Large machine learning models are revolutionary technologies of artificial intelligence whose bottlenecks include huge computational expenses, power, and time used both in the pre-training and fine-tuning process. In this work, we show that fault-tolerant quantum computing could possibly provide provably efficient resolutions for generic (stochastic) gradient descent algorithms, scaling as O(T^2 polylog(n)), where n is the size of the models and T is the number of iterations in the training, as long as the models are both sufficiently dissipative and sparse, with small learning rates. Based on earlier efficient quantum algorithms for dissipative differential equations, we find and prove that similar algorithms work for (stochastic) gradient descent, the primary algorithm for machine learning. In practice, we benchmark instances of large machine learning models from 7 million to 103 million parameters. We find that, in the context of sparse training, a quantum enhancement is possible at the early stage of learning after model pruning, motivating a sparse parameter download and re-upload scheme. Our work shows solidly that fault-tolerant quantum algorithms could potentially contribute to most state-of-the-art, large-scale machine-learning problems.
The statistical analysis of group studies in neuroscience is particularly challenging due to the complex spatio-temporal nature of the data, its multiple levels and the inter-individual variability in brain responses. In this respect, traditional ANOVA-based studies and linear mixed effects models typically provide only limited exploration of the dynamic of the group brain activity and variability of the individual responses potentially leading to overly simplistic conclusions and/or missing more intricate patterns. In this study we propose a novel method based on functional Principal Components Analysis and Bayesian model-based clustering to simultaneously assess group effects and individual deviations over the most important temporal features in the data. This method provides a thorough exploration of group differences and individual deviations in neuroscientific group studies without compromising on the spatio-temporal nature of the data. By means of a simulation study we demonstrate that the proposed model returns correct classification in different clustering scenarios under low and high of noise levels in the data. Finally we consider a case study using Electroencephalogram data recorded during an object recognition task where our approach provides new insights into the underlying brain mechanisms generating the data and their variability.
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
Deep learning is usually described as an experiment-driven field under continuous criticizes of lacking theoretical foundations. This problem has been partially fixed by a large volume of literature which has so far not been well organized. This paper reviews and organizes the recent advances in deep learning theory. The literature is categorized in six groups: (1) complexity and capacity-based approaches for analyzing the generalizability of deep learning; (2) stochastic differential equations and their dynamic systems for modelling stochastic gradient descent and its variants, which characterize the optimization and generalization of deep learning, partially inspired by Bayesian inference; (3) the geometrical structures of the loss landscape that drives the trajectories of the dynamic systems; (4) the roles of over-parameterization of deep neural networks from both positive and negative perspectives; (5) theoretical foundations of several special structures in network architectures; and (6) the increasingly intensive concerns in ethics and security and their relationships with generalizability.
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.