久久久久精品电影-啊灬啊灬啊灬快灬深用两性

Despite the empirical success of deep learning, it still lacks theoretical understandings to explain why randomly initialized neural network trained by first-order optimization methods is able to achieve zero training loss, even though its landscape is non-convex and non-smooth. Recently, there are some works to demystifies this phenomenon under over-parameterized regime. In this work, we make further progress on this area by considering a commonly used momentum optimization algorithm: Nesterov accelerated method (NAG). We analyze the convergence of NAG for two-layer fully connected neural network with ReLU activation. Specifically, we prove that the error of NAG converges to zero at a linear convergence rate $1-\Theta(1/\sqrt{\kappa})$, where $\kappa > 1$ is determined by the initialization and the architecture of neural network. Comparing to the rate $1-\Theta(1/\kappa)$ of gradient descent, NAG achieves an acceleration. Besides, it also validates NAG and Heavy-ball method can achieve a similar convergence rate.

相關內容

Neural Networks

關注 1648

神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)（Neural Networks）是世界上(shang)三個(ge)最古老的(de)(de)(de)(de)(de)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)建模學(xue)(xue)(xue)(xue)(xue)會(hui)的(de)(de)(de)(de)(de)檔案期刊:國際神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(INNS)、歐洲神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(ENNS)和(he)日本(ben)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(JNNS)。神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)提供了一(yi)(yi)(yi)個(ge)論壇，以發(fa)展和(he)培育(yu)一(yi)(yi)(yi)個(ge)國際社(she)會(hui)的(de)(de)(de)(de)(de)學(xue)(xue)(xue)(xue)(xue)者和(he)實踐者感(gan)興趣的(de)(de)(de)(de)(de)所有(you)方面的(de)(de)(de)(de)(de)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)和(he)相(xiang)關方法(fa)的(de)(de)(de)(de)(de)計(ji)(ji)算(suan)智(zhi)能。神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)歡迎高(gao)質(zhi)量(liang)論文的(de)(de)(de)(de)(de)提交，有(you)助于全面的(de)(de)(de)(de)(de)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)研究，從行為和(he)大腦建模，學(xue)(xue)(xue)(xue)(xue)習算(suan)法(fa)，通過數學(xue)(xue)(xue)(xue)(xue)和(he)計(ji)(ji)算(suan)分析，系統的(de)(de)(de)(de)(de)工(gong)程(cheng)(cheng)和(he)技術應(ying)用，大量(liang)使(shi)用神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)的(de)(de)(de)(de)(de)概念和(he)技術。這(zhe)一(yi)(yi)(yi)獨特而(er)廣泛的(de)(de)(de)(de)(de)范(fan)圍(wei)促進(jin)了生(sheng)物(wu)(wu)(wu)和(he)技術研究之間的(de)(de)(de)(de)(de)思想交流，并有(you)助于促進(jin)對生(sheng)物(wu)(wu)(wu)啟發(fa)的(de)(de)(de)(de)(de)計(ji)(ji)算(suan)智(zhi)能感(gan)興趣的(de)(de)(de)(de)(de)跨學(xue)(xue)(xue)(xue)(xue)科(ke)(ke)社(she)區的(de)(de)(de)(de)(de)發(fa)展。因此，神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)編委會(hui)代表(biao)的(de)(de)(de)(de)(de)專家(jia)領(ling)域(yu)包括心理(li)(li)學(xue)(xue)(xue)(xue)(xue)，神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)生(sheng)物(wu)(wu)(wu)學(xue)(xue)(xue)(xue)(xue)，計(ji)(ji)算(suan)機科(ke)(ke)學(xue)(xue)(xue)(xue)(xue)，工(gong)程(cheng)(cheng)，數學(xue)(xue)(xue)(xue)(xue)，物(wu)(wu)(wu)理(li)(li)。該雜志發(fa)表(biao)文章、信(xin)(xin)件(jian)和(he)評論以及給編輯的(de)(de)(de)(de)(de)信(xin)(xin)件(jian)、社(she)論、時事、軟件(jian)調查和(he)專利(li)信(xin)(xin)息。文章發(fa)表(biao)在五個(ge)部分之一(yi)(yi)(yi):認知科(ke)(ke)學(xue)(xue)(xue)(xue)(xue)，神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)科(ke)(ke)學(xue)(xue)(xue)(xue)(xue)，學(xue)(xue)(xue)(xue)(xue)習系統，數學(xue)(xue)(xue)(xue)(xue)和(he)計(ji)(ji)算(suan)分析、工(gong)程(cheng)(cheng)和(he)應(ying)用。官網(wang)(wang)地址：

AdaGrad · MoDELS · prototype · 學成 · Adam ·

2021 年 9 月 7 日

On the Convergence of Decentralized Adaptive Gradient Methods

Xiangyi Chen,Belhal Karimi,Weijie Zhao,Ping Li

Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In this paper, we introduce novel convergent decentralized adaptive gradient methods and rigorously incorporate adaptive gradient methods into decentralized training procedures. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. We illustrate the benefit of our generic decentralized framework on a prototype method, i.e., AMSGrad, both theoretically and numerically.

端正圖/道德圖 · 道德化 · 貝葉斯網/貝葉斯網絡 · Networking · 圖 ·

2021 年 9 月 6 日

Learning Bayesian Networks Under Sparsity Constraints: A Parameterized Complexity Analysis

Niels Grüttemeier,Christian Komusiewicz

from arxiv, 42 pages

We study the problem of learning the structure of an optimal Bayesian network when additional constraints are posed on the network or on its moralized graph. More precisely, we consider the constraint that the network or its moralized graph are close, in terms of vertex or edge deletions, to a sparse graph class $\Pi$. For example, we show that learning an optimal network whose moralized graph has vertex deletion distance at most $k$ from a graph with maximum degree 1 can be computed in polynomial time when $k$ is constant. This extends previous work that gave an algorithm with such a running time for the vertex deletion distance to edgeless graphs [Korhonen & Parviainen, NIPS 2015]. We then show that further extensions or improvements are presumably impossible. For example, we show that learning optimal networks where the network or its moralized graph have maximum degree $2$ or connected components of size at most $c$, $c\ge 3$, is NP-hard. Finally, we show that learning an optimal network with at most $k$ edges in the moralized graph presumably has no $f(k)\cdot |I|^{O(1)}$-time algorithm and that, in contrast, an optimal network with at most $k$ arcs can be computed in $2^{O(k)}\cdot |I|^{O(1)}$ time where $|I|$ is the total input size.

近似誤差 · 統計量 · Networking · ReLU · Neural Networks ·

2021 年 9 月 4 日

Convergence Analysis for the PINNs

Yuling Jiao,Yanming Lai,Dingwei Li,Xiliang Lu,Yang Wang,Jerry Zhijian Yang

from arxiv, arXiv admin note: text overlap with arXiv:2103.13330

In recent years, physical informed neural networks (PINNs) have been shown to be a powerful tool for solving PDEs empirically. However, numerical analysis of PINNs is still missing. In this paper, we prove the convergence rate to PINNs for the second order elliptic equations with Dirichlet boundary condition, by establishing the upper bounds on the number of training samples, depth and width of the deep neural networks to achieve desired accuracy. The error of PINNs is decomposed into approximation error and statistical error, where the approximation error is given in $C^2$ norm with $\mathrm{ReLU}^{3}$ networks, the statistical error is estimated by Rademacher complexity. We derive the bound on the Rademacher complexity of the non-Lipschitz composition of gradient norm with $\mathrm{ReLU}^{3}$ network, which is of immense independent interest.

SGD · 聯邦學習 · 學成 · Extensibility · 絕對多數投票 ·

2021 年 9 月 3 日

Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees

Richeng Jin,Yufan Huang,Xiaofan He,Huaiyu Dai,Tianfu Wu

Federated learning (FL) has emerged as a prominent distributed learning paradigm. FL entails some pressing needs for developing novel parameter estimation approaches with theoretical guarantees of convergence, which are also communication efficient, differentially private and Byzantine resilient in the heterogeneous data distribution settings. Quantization-based SGD solvers have been widely adopted in FL and the recently proposed SIGNSGD with majority vote shows a promising direction. However, no existing methods enjoy all the aforementioned properties. In this paper, we propose an intuitively-simple yet theoretically-sound method based on SIGNSGD to bridge the gap. We present Stochastic-Sign SGD which utilizes novel stochastic-sign based gradient compressors enabling the aforementioned properties in a unified framework. We also present an error-feedback variant of the proposed Stochastic-Sign SGD which further improves the learning performance in FL. We test the proposed method with extensive experiments using deep neural networks on the MNIST dataset and the CIFAR-10 dataset. The experimental results corroborate the effectiveness of the proposed method.

邊緣化 · 對率損失 · FAST · 線性分類 · Performer ·

2021 年 7 月 1 日

Fast Margin Maximization via Dual Acceleration

Ziwei Ji,Nathan Srebro,Matus Telgarsky

from arxiv, ICML 2021

We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.

跳躍連接 · Neural Networks · 優化器 · 線性的 · 圖 ·

2021 年 5 月 10 日

Optimization of Graph Neural Networks: Implicit Acceleration by Skip Connections and More Depth

Keyulu Xu,Mozhi Zhang,Stefanie Jegelka,Kenji Kawaguchi

Graph Neural Networks (GNNs) have been studied from the lens of expressive power and generalization. However, their optimization properties are less well understood. We take the first step towards analyzing GNN training by studying the gradient dynamics of GNNs. First, we analyze linearized GNNs and prove that despite the non-convexity of training, convergence to a global minimum at a linear rate is guaranteed under mild assumptions that we validate on real-world graphs. Second, we study what may affect the GNNs' training speed. Our results show that the training of GNNs is implicitly accelerated by skip connections, more depth, and/or a good label distribution. Empirical results confirm that our theoretical results for linearized GNNs align with the training behavior of nonlinear GNNs. Our results provide the first theoretical support for the success of GNNs with skip connections in terms of optimization, and suggest that deep GNNs with skip connections would be promising in practice.

圖形處理器 · Neural Networks · 圖 · 規范化的 · Principle ·

2021 年 2 月 16 日

GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training

Tianle Cai,Shengjie Luo,Keyulu Xu,Di He,Tie-Yan Liu,Liwei Wang

Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs. Faster convergence is achieved with InstanceNorm compared to BatchNorm and LayerNorm. We provide an explanation by showing that InstanceNorm serves as a preconditioner for GNNs, but such preconditioning effect is weaker with BatchNorm due to the heavy batch noise in graph datasets. Second, we show that the shift operation in InstanceNorm results in an expressiveness degradation of GNNs for highly regular graphs. We address this issue by proposing GraphNorm with a learnable shift. Empirically, GNNs with GraphNorm converge faster compared to GNNs using other normalization. GraphNorm also improves the generalization of GNNs, achieving better performance on graph classification benchmarks.

采樣法 · 方差 · 圖形處理器 · INFORMS · 泛化理論 ·

2020 年 6 月 24 日

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Weilin Cong,Rana Forsati,Mahmut Kandemir,Mehrdad Mahdavi

Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.

隨機梯度下降 · ReLU · 優化器 · Networking · 修正線性單元/整流線性單元 ·

2018 年 11 月 21 日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Difan Zou,Yuan Cao,Dongruo Zhou,Quanquan Gu

from arxiv, 47 pages

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

Better · 強化學習 · 學成 · Performer · 最優化 ·

2018 年 4 月 24 日

Accelerated Reinforcement Learning

K. Lakshmanan

from arxiv, The proof is not complete as it has to be shown the algorithm tracks the ODE

Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.