亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Recent works demonstrated the existence of a double-descent phenomenon for the generalization error of neural networks, where highly overparameterized models escape overfitting and achieve good test performance, at odds with the standard bias-variance trade-off described by statistical learning theory. In the present work, we explore a link between this phenomenon and the increase of complexity and sensitivity of the function represented by neural networks. In particular, we study the Boolean mean dimension (BMD), a metric developed in the context of Boolean function analysis. Focusing on a simple teacher-student setting for the random feature model, we derive a theoretical analysis based on the replica method that yields an interpretable expression for the BMD, in the high dimensional regime where the number of data points, the number of features, and the input size grow to infinity. We find that, as the degree of overparameterization of the network is increased, the BMD reaches an evident peak at the interpolation threshold, in correspondence with the generalization error peak, and then slowly approaches a low asymptotic value. The same phenomenology is then traced in numerical experiments with different model classes and training setups. Moreover, we find empirically that adversarially initialized models tend to show higher BMD values, and that models that are more robust to adversarial attacks exhibit a lower BMD.

相關內容

神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)(Neural Networks)是世界上三個最古老(lao)的(de)(de)(de)神(shen)(shen)經(jing)(jing)(jing)(jing)建(jian)(jian)模學(xue)會(hui)(hui)(hui)的(de)(de)(de)檔案期刊(kan):國際神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)學(xue)會(hui)(hui)(hui)(INNS)、歐(ou)洲神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)學(xue)會(hui)(hui)(hui)(ENNS)和(he)(he)(he)日本神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)學(xue)會(hui)(hui)(hui)(JNNS)。神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)提供了(le)一個論壇,以發展和(he)(he)(he)培育一個國際社(she)(she)會(hui)(hui)(hui)的(de)(de)(de)學(xue)者和(he)(he)(he)實踐(jian)者感(gan)興趣的(de)(de)(de)所有方面(mian)的(de)(de)(de)神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)和(he)(he)(he)相關方法的(de)(de)(de)計(ji)算(suan)(suan)(suan)智能。神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)歡迎高質量論文的(de)(de)(de)提交(jiao),有助于全面(mian)的(de)(de)(de)神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)研究,從(cong)行(xing)為(wei)和(he)(he)(he)大(da)腦(nao)建(jian)(jian)模,學(xue)習(xi)算(suan)(suan)(suan)法,通(tong)過(guo)數學(xue)和(he)(he)(he)計(ji)算(suan)(suan)(suan)分(fen)(fen)析,系統(tong)(tong)的(de)(de)(de)工程和(he)(he)(he)技術(shu)(shu)應(ying)用,大(da)量使用神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)的(de)(de)(de)概念(nian)和(he)(he)(he)技術(shu)(shu)。這一獨特而廣泛的(de)(de)(de)范圍(wei)促進(jin)了(le)生物(wu)和(he)(he)(he)技術(shu)(shu)研究之(zhi)間的(de)(de)(de)思想交(jiao)流,并(bing)有助于促進(jin)對生物(wu)啟(qi)發的(de)(de)(de)計(ji)算(suan)(suan)(suan)智能感(gan)興趣的(de)(de)(de)跨學(xue)科社(she)(she)區的(de)(de)(de)發展。因此,神(shen)(shen)經(jing)(jing)(jing)(jing)網(wang)絡(luo)(luo)編(bian)委會(hui)(hui)(hui)代(dai)表(biao)的(de)(de)(de)專家領域(yu)包括(kuo)心理學(xue),神(shen)(shen)經(jing)(jing)(jing)(jing)生物(wu)學(xue),計(ji)算(suan)(suan)(suan)機科學(xue),工程,數學(xue),物(wu)理。該雜(za)志發表(biao)文章、信件(jian)(jian)和(he)(he)(he)評論以及給(gei)編(bian)輯(ji)的(de)(de)(de)信件(jian)(jian)、社(she)(she)論、時事、軟件(jian)(jian)調(diao)查和(he)(he)(he)專利信息(xi)。文章發表(biao)在五個部分(fen)(fen)之(zhi)一:認知(zhi)科學(xue),神(shen)(shen)經(jing)(jing)(jing)(jing)科學(xue),學(xue)習(xi)系統(tong)(tong),數學(xue)和(he)(he)(he)計(ji)算(suan)(suan)(suan)分(fen)(fen)析、工程和(he)(he)(he)應(ying)用。 官網(wang)地址:

As the development of formal proofs is a time-consuming task, it is important to devise ways of sharing the already written proofs to prevent wasting time redoing them. One of the challenges in this domain is to translate proofs written in proof assistants based on impredicative logics to proof assistants based on predicative logics, whenever impredicativity is not used in an essential way. In this paper we present a transformation for sharing proofs with a core predicative system supporting prenex universe polymorphism (like in Agda). It consists in trying to elaborate each term into a predicative universe polymorphic term as general as possible. The use of universe polymorphism is justified by the fact that mapping each universe to a fixed one in the target theory is not sufficient in most cases. During the elaboration, we need to solve unification problems in the equational theory of universe levels. In order to do this, we give a complete characterization of when a single equation admits a most general unifier. This characterization is then employed in a partial algorithm which uses a constraint-postponement strategy for trying to solve unification problems. The proposed translation is of course partial, but in practice allows one to translate many proofs that do not use impredicativity in an essential way. Indeed, it was implemented in the tool Predicativize and then used to translate semi-automatically many non-trivial developments from Matita's library to Agda, including proofs of Bertrand's Postulate and Fermat's Little Theorem, which (as far as we know) were not available in Agda yet.

Current deep neural networks (DNNs) are overparameterized and use most of their neuronal connections during inference for each task. The human brain, however, developed specialized regions for different tasks and performs inference with a small fraction of its neuronal connections. We propose an iterative pruning strategy introducing a simple importance-score metric that deactivates unimportant connections, tackling overparameterization in DNNs and modulating the firing patterns. The aim is to find the smallest number of connections that is still capable of solving a given task with comparable accuracy, i.e. a simpler subnetwork. We achieve comparable performance for LeNet architectures on MNIST, and significantly higher parameter compression than state-of-the-art algorithms for VGG and ResNet architectures on CIFAR-10/100 and Tiny-ImageNet. Our approach also performs well for the two different optimizers considered -- Adam and SGD. The algorithm is not designed to minimize FLOPs when considering current hardware and software implementations, although it performs reasonably when compared to the state of the art.

Approximation capability of reservoir systems whose reservoir is a recurrent neural network (RNN) is discussed. In our problem setting, a reservoir system approximates a set of functions just by adjusting its linear readout while the reservoir is fixed. We will show what we call uniform strong universality of a family of RNN reservoir systems for a certain class of functions to be approximated. This means that, for any positive number, we can construct a sufficiently large RNN reservoir system whose approximation error for each function in the class of functions to be approximated is bounded from above by the positive number. Such RNN reservoir systems are constructed via parallel concatenation of RNN reservoirs.

This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d \log d)$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.

Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.

Heuristic tools from statistical physics have been used in the past to locate the phase transitions and compute the optimal learning and generalization errors in the teacher-student scenario in multi-layer neural networks. In this contribution, we provide a rigorous justification of these approaches for a two-layers neural network model called the committee machine. We also introduce a version of the approximate message passing (AMP) algorithm for the committee machine that allows to perform optimal learning in polynomial time for a large set of parameters. We find that there are regimes in which a low generalization error is information-theoretically achievable while the AMP algorithm fails to deliver it, strongly suggesting that no efficient algorithm exists for those cases, and unveiling a large computational gap.

We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.

北京阿比特科技有限公司