亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Despite the empirical success of deep learning, it still lacks theoretical understandings to explain why randomly initialized neural network trained by first-order optimization methods is able to achieve zero training loss, even though its landscape is non-convex and non-smooth. Recently, there are some works to demystifies this phenomenon under over-parameterized regime. In this work, we make further progress on this area by considering a commonly used momentum optimization algorithm: Nesterov accelerated method (NAG). We analyze the convergence of NAG for two-layer fully connected neural network with ReLU activation. Specifically, we prove that the error of NAG converges to zero at a linear convergence rate $1-\Theta(1/\sqrt{\kappa})$, where $\kappa > 1$ is determined by the initialization and the architecture of neural network. Comparing to the rate $1-\Theta(1/\kappa)$ of gradient descent, NAG achieves an acceleration. Besides, it also validates NAG and Heavy-ball method can achieve a similar convergence rate.

相關內容

神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)(Neural Networks)是世界上(shang)三個(ge)最古老的(de)(de)(de)(de)(de)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)建模學(xue)(xue)(xue)(xue)(xue)會(hui)的(de)(de)(de)(de)(de)檔案期刊:國際神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(INNS)、歐洲神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(ENNS)和(he)日本(ben)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(JNNS)。神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)提供了一(yi)(yi)(yi)個(ge)論壇,以發(fa)展和(he)培育(yu)一(yi)(yi)(yi)個(ge)國際社(she)會(hui)的(de)(de)(de)(de)(de)學(xue)(xue)(xue)(xue)(xue)者和(he)實踐者感(gan)興趣的(de)(de)(de)(de)(de)所有(you)方面的(de)(de)(de)(de)(de)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)和(he)相(xiang)關方法(fa)的(de)(de)(de)(de)(de)計(ji)(ji)算(suan)智(zhi)能。神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)歡迎高(gao)質(zhi)量(liang)論文的(de)(de)(de)(de)(de)提交,有(you)助于全面的(de)(de)(de)(de)(de)神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)研究,從行為和(he)大腦建模,學(xue)(xue)(xue)(xue)(xue)習算(suan)法(fa),通過數學(xue)(xue)(xue)(xue)(xue)和(he)計(ji)(ji)算(suan)分析,系統的(de)(de)(de)(de)(de)工(gong)程(cheng)(cheng)和(he)技術應(ying)用,大量(liang)使(shi)用神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)的(de)(de)(de)(de)(de)概念和(he)技術。這(zhe)一(yi)(yi)(yi)獨特而(er)廣泛的(de)(de)(de)(de)(de)范(fan)圍(wei)促進(jin)了生(sheng)物(wu)(wu)(wu)和(he)技術研究之間的(de)(de)(de)(de)(de)思想交流,并有(you)助于促進(jin)對生(sheng)物(wu)(wu)(wu)啟發(fa)的(de)(de)(de)(de)(de)計(ji)(ji)算(suan)智(zhi)能感(gan)興趣的(de)(de)(de)(de)(de)跨學(xue)(xue)(xue)(xue)(xue)科(ke)(ke)社(she)區的(de)(de)(de)(de)(de)發(fa)展。因此,神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)網(wang)(wang)絡(luo)(luo)(luo)(luo)編委會(hui)代表(biao)的(de)(de)(de)(de)(de)專家(jia)領(ling)域(yu)包括心理(li)(li)學(xue)(xue)(xue)(xue)(xue),神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)生(sheng)物(wu)(wu)(wu)學(xue)(xue)(xue)(xue)(xue),計(ji)(ji)算(suan)機科(ke)(ke)學(xue)(xue)(xue)(xue)(xue),工(gong)程(cheng)(cheng),數學(xue)(xue)(xue)(xue)(xue),物(wu)(wu)(wu)理(li)(li)。該雜志發(fa)表(biao)文章、信(xin)(xin)件(jian)和(he)評論以及給編輯的(de)(de)(de)(de)(de)信(xin)(xin)件(jian)、社(she)論、時事、軟件(jian)調查和(he)專利(li)信(xin)(xin)息。文章發(fa)表(biao)在五個(ge)部分之一(yi)(yi)(yi):認知科(ke)(ke)學(xue)(xue)(xue)(xue)(xue),神(shen)(shen)(shen)(shen)經(jing)(jing)(jing)(jing)(jing)科(ke)(ke)學(xue)(xue)(xue)(xue)(xue),學(xue)(xue)(xue)(xue)(xue)習系統,數學(xue)(xue)(xue)(xue)(xue)和(he)計(ji)(ji)算(suan)分析、工(gong)程(cheng)(cheng)和(he)應(ying)用。 官網(wang)(wang)地址:

Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In this paper, we introduce novel convergent decentralized adaptive gradient methods and rigorously incorporate adaptive gradient methods into decentralized training procedures. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. We illustrate the benefit of our generic decentralized framework on a prototype method, i.e., AMSGrad, both theoretically and numerically.

We study the problem of learning the structure of an optimal Bayesian network when additional constraints are posed on the network or on its moralized graph. More precisely, we consider the constraint that the network or its moralized graph are close, in terms of vertex or edge deletions, to a sparse graph class $\Pi$. For example, we show that learning an optimal network whose moralized graph has vertex deletion distance at most $k$ from a graph with maximum degree 1 can be computed in polynomial time when $k$ is constant. This extends previous work that gave an algorithm with such a running time for the vertex deletion distance to edgeless graphs [Korhonen & Parviainen, NIPS 2015]. We then show that further extensions or improvements are presumably impossible. For example, we show that learning optimal networks where the network or its moralized graph have maximum degree $2$ or connected components of size at most $c$, $c\ge 3$, is NP-hard. Finally, we show that learning an optimal network with at most $k$ edges in the moralized graph presumably has no $f(k)\cdot |I|^{O(1)}$-time algorithm and that, in contrast, an optimal network with at most $k$ arcs can be computed in $2^{O(k)}\cdot |I|^{O(1)}$ time where $|I|$ is the total input size.

In recent years, physical informed neural networks (PINNs) have been shown to be a powerful tool for solving PDEs empirically. However, numerical analysis of PINNs is still missing. In this paper, we prove the convergence rate to PINNs for the second order elliptic equations with Dirichlet boundary condition, by establishing the upper bounds on the number of training samples, depth and width of the deep neural networks to achieve desired accuracy. The error of PINNs is decomposed into approximation error and statistical error, where the approximation error is given in $C^2$ norm with $\mathrm{ReLU}^{3}$ networks, the statistical error is estimated by Rademacher complexity. We derive the bound on the Rademacher complexity of the non-Lipschitz composition of gradient norm with $\mathrm{ReLU}^{3}$ network, which is of immense independent interest.

Federated learning (FL) has emerged as a prominent distributed learning paradigm. FL entails some pressing needs for developing novel parameter estimation approaches with theoretical guarantees of convergence, which are also communication efficient, differentially private and Byzantine resilient in the heterogeneous data distribution settings. Quantization-based SGD solvers have been widely adopted in FL and the recently proposed SIGNSGD with majority vote shows a promising direction. However, no existing methods enjoy all the aforementioned properties. In this paper, we propose an intuitively-simple yet theoretically-sound method based on SIGNSGD to bridge the gap. We present Stochastic-Sign SGD which utilizes novel stochastic-sign based gradient compressors enabling the aforementioned properties in a unified framework. We also present an error-feedback variant of the proposed Stochastic-Sign SGD which further improves the learning performance in FL. We test the proposed method with extensive experiments using deep neural networks on the MNIST dataset and the CIFAR-10 dataset. The experimental results corroborate the effectiveness of the proposed method.

We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.

Graph Neural Networks (GNNs) have been studied from the lens of expressive power and generalization. However, their optimization properties are less well understood. We take the first step towards analyzing GNN training by studying the gradient dynamics of GNNs. First, we analyze linearized GNNs and prove that despite the non-convexity of training, convergence to a global minimum at a linear rate is guaranteed under mild assumptions that we validate on real-world graphs. Second, we study what may affect the GNNs' training speed. Our results show that the training of GNNs is implicitly accelerated by skip connections, more depth, and/or a good label distribution. Empirical results confirm that our theoretical results for linearized GNNs align with the training behavior of nonlinear GNNs. Our results provide the first theoretical support for the success of GNNs with skip connections in terms of optimization, and suggest that deep GNNs with skip connections would be promising in practice.

Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs. Faster convergence is achieved with InstanceNorm compared to BatchNorm and LayerNorm. We provide an explanation by showing that InstanceNorm serves as a preconditioner for GNNs, but such preconditioning effect is weaker with BatchNorm due to the heavy batch noise in graph datasets. Second, we show that the shift operation in InstanceNorm results in an expressiveness degradation of GNNs for highly regular graphs. We address this issue by proposing GraphNorm with a learnable shift. Empirically, GNNs with GraphNorm converge faster compared to GNNs using other normalization. GraphNorm also improves the generalization of GNNs, achieving better performance on graph classification benchmarks.

Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.

北京阿比特科技有限公司