One of the fundamental problems in machine learning is generalization. In neural network models with a large number of weights (parameters), many solutions can be found to fit the training data equally well. The key question is which solution can describe testing data not in the training set. Here, we report the discovery of an exact duality (equivalence) between changes in activities in a given layer of neurons and changes in weights that connect to the next layer of neurons in a densely connected layer in any feed forward neural network. The activity-weight (A-W) duality allows us to map variations in inputs (data) to variations of the corresponding dual weights. By using this mapping, we show that the generalization loss can be decomposed into a sum of contributions from different eigen-directions of the Hessian matrix of the loss function at the solution in weight space. The contribution from a given eigen-direction is the product of two geometric factors (determinants): the sharpness of the loss landscape and the standard deviation of the dual weights, which is found to scale with the weight norm of the solution. Our results provide an unified framework, which we used to reveal how different regularization schemes (weight decay, stochastic gradient descent with different batch sizes and learning rates, dropout), training data size, and labeling noise affect generalization performance by controlling either one or both of these two geometric determinants for generalization. These insights can be used to guide development of algorithms for finding more generalizable solutions in overparametrized neural networks.
A method for sound field decomposition based on neural networks is proposed. The method comprises two stages: a sound field separation stage and a single-source localization stage. In the first stage, the sound pressure at microphones synthesized by multiple sources is separated into one excited by each sound source. In the second stage, the source location is obtained as a regression from the sound pressure at microphones consisting of a single sound source. The estimated location is not affected by discretization because the second stage is designed as a regression rather than a classification. Datasets are generated by simulation using Green's function, and the neural network is trained for each frequency. Numerical experiments reveal that, compared with conventional methods, the proposed method can achieve higher source-localization accuracy and higher sound-field-reconstruction accuracy.
Science mapping is an important tool to gain insight into scientific fields, to identify emerging research trends, and to support science policy. Understanding the different ways in which different science mapping approaches capture the structure of scientific fields is critical. This paper presents a comparative analysis of two commonly used approaches, topic modeling (TM) and citation-based clustering (CC), to assess their respective strengths, weaknesses, and the characteristics of their results. We compare the two approaches using cluster-to-topic and topic-to-cluster mappings based on science maps of cardiovascular research (CVR) generated by TM and CC. Our findings reveal that relations between topics and clusters are generally weak, with limited overlap between topics and clusters. Only in a few exceptional cases do more than one-third of the documents in a topic belong to the same cluster, or vice versa. CC excels at identifying diseases and generating specialized clusters in Clinical Treatment & Surgical Procedures, while TM focuses on sub-techniques within diagnostic techniques, provides a general perspective on Clinical Treatment & Surgical Procedures, and identifies distinct topics related to practical guidelines. Our work enhances the understanding of science mapping approaches based on TM and CC and delivers practical guidance for scientometricians on how to apply these approaches effectively.
Neural networks are high-dimensional nonlinear dynamical systems that process information through the coordinated activity of many connected units. Understanding how biological and machine-learning networks function and learn requires knowledge of the structure of this coordinated activity, information contained, for example, in cross covariances between units. Self-consistent dynamical mean field theory (DMFT) has elucidated several features of random neural networks -- in particular, that they can generate chaotic activity -- however, a calculation of cross covariances using this approach has not been provided. Here, we calculate cross covariances self-consistently via a two-site cavity DMFT. We use this theory to probe spatiotemporal features of activity coordination in a classic random-network model with independent and identically distributed (i.i.d.) couplings, showing an extensive but fractionally low effective dimension of activity and a long population-level timescale. Our formulae apply to a wide range of single-unit dynamics and generalize to non-i.i.d. couplings. As an example of the latter, we analyze the case of partially symmetric couplings.
This paper presents a novel approach to construct regularizing operators for severely ill-posed Fredholm integral equations of the first kind by introducing parametrized discretization. The optimal values of discretization and regularization parameters are computed simultaneously by solving a minimization problem formulated based on a regularization parameter search criterion. The effectiveness of the proposed approach is demonstrated through examples of noisy Laplace transform inversions and the deconvolution of nuclear magnetic resonance relaxation data.
Reinforcement learning of real-world tasks is very data inefficient, and extensive simulation-based modelling has become the dominant approach for training systems. However, in human-robot interaction and many other real-world settings, there is no appropriate one-model-for-all due to differences in individual instances of the system (e.g. different people) or necessary oversimplifications in the simulation models. This requires two approaches: 1. either learning the individual system's dynamics approximately from data which requires data-intensive training or 2. using a complete digital twin of the instances, which may not be realisable in many cases. We introduce two approaches: co-kriging adjustments (CKA) and ridge regression adjustment (RRA) as novel ways to combine the advantages of both approaches. Our adjustment methods are based on an auto-regressive AR1 co-kriging model that we integrate with GP priors. This yield a data- and simulation-efficient way of using simplistic simulation models (e.g., simple two-link model) and rapidly adapting them to individual instances (e.g., biomechanics of individual people). Using CKA and RRA, we obtain more accurate uncertainty quantification of the entire system's dynamics than pure GP-based and AR1 methods. We demonstrate the efficiency of co-kriging adjustment with an interpretable reinforcement learning control example, learning to control a biomechanical human arm using only a two-link arm simulation model (offline part) and CKA derived from a small amount of interaction data (on-the-fly online). Our method unlocks an efficient and uncertainty-aware way to implement reinforcement learning methods in real world complex systems for which only imperfect simulation models exist.
We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.
Developing an efficient computational scheme for high-dimensional Bayesian variable selection in generalised linear models and survival models has always been a challenging problem due to the absence of closed-form solutions for the marginal likelihood. The RJMCMC approach can be employed to samples model and coefficients jointly, but effective design of the transdimensional jumps of RJMCMC can be challenge, making it hard to implement. Alternatively, the marginal likelihood can be derived using data-augmentation scheme e.g. Polya-gamma data argumentation for logistic regression) or through other estimation methods. However, suitable data-augmentation schemes are not available for every generalised linear and survival models, and using estimations such as Laplace approximation or correlated pseudo-marginal to derive marginal likelihood within a locally informed proposal can be computationally expensive in the "large n, large p" settings. In this paper, three main contributions are presented. Firstly, we present an extended Point-wise implementation of Adaptive Random Neighbourhood Informed proposal (PARNI) to efficiently sample models directly from the marginal posterior distribution in both generalised linear models and survival models. Secondly, in the light of the approximate Laplace approximation, we also describe an efficient and accurate estimation method for the marginal likelihood which involves adaptive parameters. Additionally, we describe a new method to adapt the algorithmic tuning parameters of the PARNI proposal by replacing the Rao-Blackwellised estimates with the combination of a warm-start estimate and an ergodic average. We present numerous numerical results from simulated data and 8 high-dimensional gene fine mapping data-sets to showcase the efficiency of the novel PARNI proposal compared to the baseline add-delete-swap proposal.
This work proposes an adjacent-category autoregressive model for time series of ordinal variables. We apply this model to dendrochronological records to study the effect of climate on the intensity of spruce budworm defoliation during outbreaks in two sites in eastern Canada. The model's parameters are estimated using the maximum likelihood approach. We show that this estimator is consistent and asymptotically Gaussian distributed. We also propose a Portemanteau test for goodness-of-fit. Our study shows that the seasonal ranges of maximum daily temperatures in the spring and summer have a significant quadratic effect on defoliation. The study reveals that for both regions, a greater range of summer daily maximum temperatures is associated with lower levels of defoliation up to a threshold estimated at 22.7C (CI of 0-39.7C at 95%) in T\'emiscamingue and 21.8C (CI of 0-54.2C at 95%) for Matawinie. For Matawinie, a greater range in spring daily maximum temperatures increased defoliation, up to a threshold of 32.5C (CI of 0-80.0C). We also present a statistical test to compare the autoregressive parameter values between different fits of the model, which allows us to detect changes in the defoliation dynamics between the study sites in terms of their respective autoregression structures.
We study the machine learning task for models with operators mapping between the Wasserstein space of probability measures and a space of functions, like e.g. in mean-field games/control problems. Two classes of neural networks, based on bin density and on cylindrical approximation, are proposed to learn these so-called mean-field functions, and are theoretically supported by universal approximation theorems. We perform several numerical experiments for training these two mean-field neural networks, and show their accuracy and efficiency in the generalization error with various test distributions. Finally, we present different algorithms relying on mean-field neural networks for solving time-dependent mean-field problems, and illustrate our results with numerical tests for the example of a semi-linear partial differential equation in the Wasserstein space of probability measures.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.