亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$. Our main results characterize a hierarchical property, the "merged-staircase property", that is both necessary and nearly sufficient for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new "dimension-free" dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions.

相關內容

神(shen)(shen)經(jing)(jing)網(wang)絡(luo)(Neural Networks)是世界(jie)上三個最(zui)古老的(de)(de)(de)(de)神(shen)(shen)經(jing)(jing)建模學(xue)(xue)會(hui)的(de)(de)(de)(de)檔案期刊:國際(ji)神(shen)(shen)經(jing)(jing)網(wang)絡(luo)學(xue)(xue)會(hui)(INNS)、歐洲(zhou)神(shen)(shen)經(jing)(jing)網(wang)絡(luo)學(xue)(xue)會(hui)(ENNS)和(he)日本神(shen)(shen)經(jing)(jing)網(wang)絡(luo)學(xue)(xue)會(hui)(JNNS)。神(shen)(shen)經(jing)(jing)網(wang)絡(luo)提供了(le)一(yi)(yi)個論壇,以發(fa)(fa)展(zhan)(zhan)和(he)培育一(yi)(yi)個國際(ji)社(she)(she)會(hui)的(de)(de)(de)(de)學(xue)(xue)者和(he)實踐者感興趣(qu)的(de)(de)(de)(de)所有(you)方面(mian)的(de)(de)(de)(de)神(shen)(shen)經(jing)(jing)網(wang)絡(luo)和(he)相關方法的(de)(de)(de)(de)計(ji)(ji)算(suan)(suan)智(zhi)能。神(shen)(shen)經(jing)(jing)網(wang)絡(luo)歡迎(ying)高質量論文的(de)(de)(de)(de)提交(jiao),有(you)助于全面(mian)的(de)(de)(de)(de)神(shen)(shen)經(jing)(jing)網(wang)絡(luo)研究,從(cong)行為和(he)大腦建模,學(xue)(xue)習(xi)(xi)算(suan)(suan)法,通(tong)過數學(xue)(xue)和(he)計(ji)(ji)算(suan)(suan)分析,系(xi)統(tong)的(de)(de)(de)(de)工(gong)程和(he)技術應用,大量使用神(shen)(shen)經(jing)(jing)網(wang)絡(luo)的(de)(de)(de)(de)概念和(he)技術。這一(yi)(yi)獨特而廣泛的(de)(de)(de)(de)范圍促進了(le)生(sheng)(sheng)物和(he)技術研究之(zhi)間的(de)(de)(de)(de)思想交(jiao)流,并有(you)助于促進對生(sheng)(sheng)物啟發(fa)(fa)的(de)(de)(de)(de)計(ji)(ji)算(suan)(suan)智(zhi)能感興趣(qu)的(de)(de)(de)(de)跨學(xue)(xue)科社(she)(she)區的(de)(de)(de)(de)發(fa)(fa)展(zhan)(zhan)。因此,神(shen)(shen)經(jing)(jing)網(wang)絡(luo)編委(wei)會(hui)代(dai)表(biao)(biao)的(de)(de)(de)(de)專(zhuan)家領域(yu)包括心(xin)理學(xue)(xue),神(shen)(shen)經(jing)(jing)生(sheng)(sheng)物學(xue)(xue),計(ji)(ji)算(suan)(suan)機科學(xue)(xue),工(gong)程,數學(xue)(xue),物理。該雜志發(fa)(fa)表(biao)(biao)文章、信件(jian)和(he)評論以及給編輯的(de)(de)(de)(de)信件(jian)、社(she)(she)論、時事(shi)、軟件(jian)調(diao)查和(he)專(zhuan)利信息。文章發(fa)(fa)表(biao)(biao)在五個部分之(zhi)一(yi)(yi):認知科學(xue)(xue),神(shen)(shen)經(jing)(jing)科學(xue)(xue),學(xue)(xue)習(xi)(xi)系(xi)統(tong),數學(xue)(xue)和(he)計(ji)(ji)算(suan)(suan)分析、工(gong)程和(he)應用。 官網(wang)地址:

We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying Markov random processes parameterized by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated "faster" (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, convexity, the Polyak-Lojasiewicz condition, and general non-convexity. We apply our framework to two problems in control and reinforcement learning. First, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of O(k^(-2/5)), which recovers the best known rate derived specifically for this problem. Second, we study an online actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of O(k^(-2/3)) is achieved. This is the first time such a result is known in the literature. Finally, we support our theoretical analysis with numerical simulations where the convergence rates are visualized.

Operator learning for complex nonlinear operators is increasingly common in modeling physical systems. However, training machine learning methods to learn such operators requires a large amount of expensive, high-fidelity data. In this work, we present a composite Deep Operator Network (DeepONet) for learning using two datasets with different levels of fidelity, to accurately learn complex operators when sufficient high-fidelity data is not available. Additionally, we demonstrate that the presence of low-fidelity data can improve the predictions of physics-informed learning with DeepONets.

We extend the Deep Galerkin Method (DGM) introduced in Sirignano and Spiliopoulos (2018)} to solve a number of partial differential equations (PDEs) that arise in the context of optimal stochastic control and mean field games. First, we consider PDEs where the function is constrained to be positive and integrate to unity, as is the case with Fokker-Planck equations. Our approach involves reparameterizing the solution as the exponential of a neural network appropriately normalized to ensure both requirements are satisfied. This then gives rise to nonlinear a partial integro-differential equation (PIDE) where the integral appearing in the equation is handled by a novel application of importance sampling. Secondly, we tackle a number of Hamilton-Jacobi-Bellman (HJB) equations that appear in stochastic optimal control problems. The key contribution is that these equations are approached in their unsimplified primal form which includes an optimization problem as part of the equation. We extend the DGM algorithm to solve for the value function and the optimal control \simultaneously by characterizing both as deep neural networks. Training the networks is performed by taking alternating stochastic gradient descent steps for the two functions, a technique inspired by the policy improvement algorithms (PIA).

The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data that may not be sparse; however, existing results depend on the independent setting of samples. In this study, we analyze a linear regression model with dependent time series data under over-parameterization settings. We consider an estimator via interpolation and developed a theory for excess risk of the estimator under multiple dependence types. This theory can treat infinite-dimensional data without sparsity and handle long-memory processes in a unified manner. Moreover, we bound the risk in our theory via the integrated covariance and nondegeneracy of autocorrelation matrices. The results show that the convergence rate of risks with short-memory processes is identical to that of cases with independent data, while long-memory processes slow the convergence rate. We also present several examples of specific dependent processes that can be applied to our setting.

Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or ReLU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.

The adaptive processing of structured data is a long-standing research topic in machine learning that investigates how to automatically learn a mapping from a structured input to outputs of various nature. Recently, there has been an increasing interest in the adaptive processing of graphs, which led to the development of different neural network-based methodologies. In this thesis, we take a different route and develop a Bayesian Deep Learning framework for graph learning. The dissertation begins with a review of the principles over which most of the methods in the field are built, followed by a study on graph classification reproducibility issues. We then proceed to bridge the basic ideas of deep learning for graphs with the Bayesian world, by building our deep architectures in an incremental fashion. This framework allows us to consider graphs with discrete and continuous edge features, producing unsupervised embeddings rich enough to reach the state of the art on several classification tasks. Our approach is also amenable to a Bayesian nonparametric extension that automatizes the choice of almost all model's hyper-parameters. Two real-world applications demonstrate the efficacy of deep learning for graphs. The first concerns the prediction of information-theoretic quantities for molecular simulations with supervised neural models. After that, we exploit our Bayesian models to solve a malware-classification task while being robust to intra-procedural code obfuscation techniques. We conclude the dissertation with an attempt to blend the best of the neural and Bayesian worlds together. The resulting hybrid model is able to predict multimodal distributions conditioned on input graphs, with the consequent ability to model stochasticity and uncertainty better than most works. Overall, we aim to provide a Bayesian perspective into the articulated research field of deep learning for graphs.

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.

This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.

Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.

北京阿比特科技有限公司