销魂美女一区二区三区AV_中文熟妇亚洲视频观看_精品无码一区二区三区性色_在线观看亚洲专区_日本黄色片三级视频一区二区_日韩午夜在线观看完整版_麻豆精品视频在线观看

We study non-convex subgradient flows for training two-layer ReLU neural networks from a convex geometry and duality perspective. We characterize the implicit bias of unregularized non-convex gradient flow as convex regularization of an equivalent convex model. We then show that the limit points of non-convex subgradient flows can be identified via primal-dual correspondence in this convex optimization problem. Moreover, we derive a sufficient condition on the dual variables which ensures that the stationary points of the non-convex objective are the KKT points of the convex objective, thus proving convergence of non-convex gradient flows to the global optimum. For a class of regular training data distributions such as orthogonal separable data, we show that this sufficient condition holds. Therefore, non-convex gradient flows in fact converge to optimal solutions of a convex optimization problem. We present numerical results verifying the predictions of our theory for non-convex subgradient descent.

相關內容

Neural Networks

關注 1648

神經(jing)(jing)網(wang)絡(luo)(luo)（Neural Networks）是世界上(shang)三個(ge)(ge)最古老的(de)(de)(de)神經(jing)(jing)建(jian)模(mo)學(xue)(xue)(xue)會(hui)(hui)的(de)(de)(de)檔案期刊:國(guo)際神經(jing)(jing)網(wang)絡(luo)(luo)學(xue)(xue)(xue)會(hui)(hui)(INNS)、歐(ou)洲(zhou)神經(jing)(jing)網(wang)絡(luo)(luo)學(xue)(xue)(xue)會(hui)(hui)(ENNS)和(he)日本神經(jing)(jing)網(wang)絡(luo)(luo)學(xue)(xue)(xue)會(hui)(hui)(JNNS)。神經(jing)(jing)網(wang)絡(luo)(luo)提供(gong)了一個(ge)(ge)論(lun)壇，以(yi)發展(zhan)和(he)培育一個(ge)(ge)國(guo)際社(she)會(hui)(hui)的(de)(de)(de)學(xue)(xue)(xue)者(zhe)和(he)實踐者(zhe)感(gan)興趣的(de)(de)(de)所(suo)有方面(mian)的(de)(de)(de)神經(jing)(jing)網(wang)絡(luo)(luo)和(he)相關方法的(de)(de)(de)計(ji)(ji)算(suan)(suan)(suan)智(zhi)能。神經(jing)(jing)網(wang)絡(luo)(luo)歡(huan)迎高質(zhi)量論(lun)文(wen)的(de)(de)(de)提交(jiao)，有助于全面(mian)的(de)(de)(de)神經(jing)(jing)網(wang)絡(luo)(luo)研(yan)究，從(cong)行為(wei)和(he)大(da)腦(nao)建(jian)模(mo)，學(xue)(xue)(xue)習算(suan)(suan)(suan)法，通(tong)過(guo)數(shu)學(xue)(xue)(xue)和(he)計(ji)(ji)算(suan)(suan)(suan)分(fen)析，系統的(de)(de)(de)工程和(he)技術應用，大(da)量使(shi)用神經(jing)(jing)網(wang)絡(luo)(luo)的(de)(de)(de)概念(nian)和(he)技術。這一獨(du)特而廣泛的(de)(de)(de)范圍促進了生(sheng)物(wu)(wu)和(he)技術研(yan)究之(zhi)(zhi)間(jian)的(de)(de)(de)思想交(jiao)流，并(bing)有助于促進對(dui)生(sheng)物(wu)(wu)啟(qi)發的(de)(de)(de)計(ji)(ji)算(suan)(suan)(suan)智(zhi)能感(gan)興趣的(de)(de)(de)跨學(xue)(xue)(xue)科(ke)(ke)社(she)區的(de)(de)(de)發展(zhan)。因(yin)此，神經(jing)(jing)網(wang)絡(luo)(luo)編(bian)委會(hui)(hui)代表的(de)(de)(de)專(zhuan)家領域包括心理(li)學(xue)(xue)(xue)，神經(jing)(jing)生(sheng)物(wu)(wu)學(xue)(xue)(xue)，計(ji)(ji)算(suan)(suan)(suan)機科(ke)(ke)學(xue)(xue)(xue)，工程，數(shu)學(xue)(xue)(xue)，物(wu)(wu)理(li)。該雜志發表文(wen)章、信(xin)件(jian)和(he)評論(lun)以(yi)及給編(bian)輯的(de)(de)(de)信(xin)件(jian)、社(she)論(lun)、時事、軟件(jian)調查和(he)專(zhuan)利信(xin)息。文(wen)章發表在五個(ge)(ge)部分(fen)之(zhi)(zhi)一:認知科(ke)(ke)學(xue)(xue)(xue)，神經(jing)(jing)科(ke)(ke)學(xue)(xue)(xue)，學(xue)(xue)(xue)習系統，數(shu)學(xue)(xue)(xue)和(he)計(ji)(ji)算(suan)(suan)(suan)分(fen)析、工程和(he)應用。官網(wang)地址：

線性的 · 正交 · 奇異的 · 隨機梯度下降 · 變換 ·

2021 年 12 月 9 日

Solving A System Of Linear Equations By Randomized Orthogonal Projections

Alireza Entezari,Arunava Banerjee,Leila Kalantari

Randomization has shown catalyzing effects in linear algebra with promising perspectives for tackling computational challenges in large-scale problems. For solving a system of linear equations, we demonstrate the convergence of a broad class of algorithms that at each step pick a subset of $n$ equations at random and update the iterate with the orthogonal projection to the subspace those equations represent. We identify, in this context, a specific degree-$n$ polynomial that non-linearly transforms the singular values of the system towards equalization. This transformation to singular values and the corresponding condition number then characterizes the expected convergence rate of iterations. As a consequence, our results specify the convergence rate of the stochastic gradient descent algorithm, in terms of the mini-batch size $n$, when used for solving systems of linear equations.

線性回歸 · 線性的 · 正則化項 · 可理解性 · 拉索回歸 ·

2021 年 12 月 8 日

Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression

Lijia Zhou,Frederic Koehler,Danica J. Sutherland,Nathan Srebro

We study a localized notion of uniform convergence known as an "optimistic rate" (Panchenko 2002; Srebro et al. 2010) for linear regression with Gaussian data. Our refined analysis avoids the hidden constant and logarithmic factor in existing results, which are known to be crucial in high-dimensional settings, especially for understanding interpolation learning. As a special case, our analysis recovers the guarantee from Koehler et al. (2021), which tightly characterizes the population risk of low-norm interpolators under the benign overfitting conditions. Our optimistic rate bound, though, also analyzes predictors with arbitrary training error. This allows us to recover some classical statistical guarantees for ridge and LASSO regression under random designs, and helps us obtain a precise understanding of the excess risk of near-interpolators in the over-parameterized regime.

平方損失 · 方陣 · 可理解性 · Neural Networks · 泛化誤差 ·

2021 年 12 月 7 日

Understanding Square Loss in Training Overparametrized Neural Network Classifiers

Tianyang Hu,Jun Wang,Wenjia Wang,Zhenguo Li

Deep learning has achieved many breakthroughs in modern classification tasks. Numerous architectures have been proposed for different data structures but when it comes to the loss function, the cross-entropy loss is the predominant choice. Recently, several alternative losses have seen revived interests for deep classifiers. In particular, empirical evidence seems to promote square loss but a theoretical justification is still lacking. In this work, we contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks in the neural tangent kernel (NTK) regime. Interesting properties regarding the generalization error, robustness, and calibration error are revealed. We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error. When classes are separable, the misclassification rate improves to be exponentially fast. Further, the resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness. We expect our findings to hold beyond the NTK regime and translate to practical settings. To this end, we conduct extensive empirical studies on practical neural networks, demonstrating the effectiveness of square loss in both synthetic low-dimensional data and real image data. Comparing to cross-entropy, square loss has comparable generalization error but noticeable advantages in robustness and model calibration.

Better · 有偏 · 隨機梯度下降 · 線性的 · Networks ·

2021 年 12 月 7 日

Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Scott Pesme,Loucas Pillaud-Vivien,Nicolas Flammarion

Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.

權重衰減 · 批量規范化 · Weight · 規范化的 · 周期的 ·

2021 年 12 月 5 日

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Ekaterina Lobacheva,Maxim Kodryan,Nadezhda Chirkova,Andrey Malinin,Dmitry Vetrov

from arxiv, Published in NeurIPS 2021. First two authors contributed equally

Training neural networks with batch normalization and weight decay has become a common practice in recent years. In this work, we show that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We rigorously investigate the mechanism underlying the discovered periodic behavior from both empirical and theoretical points of view and analyze the conditions in which it occurs in practice. We also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.

可交換的 · Continuity · 泛函 · INTERACT · 圖 ·

2021 年 11 月 18 日

Gradient flows on graphons: existence, convergence, continuity equations

Sewoong Oh,Soumik Pal,Raghav Somani,Raghav Tripathi

from arxiv, 40 pages, 2 figures

Wasserstein gradient flows on probability measures have found a host of applications in various optimization problems. They typically arise as the continuum limit of exchangeable particle systems evolving by some mean-field interaction involving a gradient-type potential. However, in many problems, such as in multi-layer neural networks, the so-called particles are edge weights on large graphs whose nodes are exchangeable. Such large graphs are known to converge to continuum limits called graphons as their size grow to infinity. We show that the Euclidean gradient flow of a suitable function of the edge-weights converges to a novel continuum limit given by a curve on the space of graphons that can be appropriately described as a gradient flow or, more technically, a curve of maximal slope. Several natural functions on graphons, such as homomorphism functions and the scalar entropy, are covered by our set-up, and the examples have been worked out in detail.

Neural Networks · Networking · 圖形處理器 · 前饋 · 圖 ·

2021 年 2 月 21 日

How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks

Keyulu Xu,Mozhi Zhang,Jingling Li,Simon S. Du,Ken-ichi Kawarabayashi,Stefanie Jegelka

We study how neural networks trained by gradient descent extrapolate, i.e., what they learn outside the support of the training distribution. Previous works report mixed empirical results when extrapolating with neural networks: while feedforward neural networks, a.k.a. multilayer perceptrons (MLPs), do not extrapolate well in certain simple tasks, Graph Neural Networks (GNNs), a structured network with MLP modules, have shown some success in more complex tasks. Working towards a theoretical explanation, we identify conditions under which MLPs and GNNs extrapolate well. First, we quantify the observation that ReLU MLPs quickly converge to linear functions along any direction from the origin, which implies that ReLU MLPs do not extrapolate most nonlinear functions. But, they can provably learn a linear target function when the training distribution is sufficiently diverse. Second, in connection to analyzing the successes and limitations of GNNs, these results suggest a hypothesis for which we provide theoretical and empirical evidence: the success of GNNs in extrapolating algorithmic tasks to new data (e.g., larger graphs or edge weights) relies on encoding task-specific non-linearities in the architecture or features. Our theoretical analysis builds on a connection of over-parameterized networks to the neural tangent kernel. Empirically, our theory holds across different training settings.

最大平均偏差 · 優化器 · Performer · CASES · tuning ·

2018 年 1 月 30 日

Stable Distribution Alignment Using the Dual of the Adversarial Distance

Ben Usman,Kate Saenko,Brian Kulis

from arxiv, ICLR 2018 Conference Invite to Workshop

Methods that align distributions by minimizing an adversarial distance between them have recently achieved impressive results. However, these approaches are difficult to optimize with gradient descent and they often do not converge well without careful hyperparameter tuning and proper initialization. We investigate whether turning the adversarial min-max problem into an optimization problem by replacing the maximization part with its dual improves the quality of the resulting alignment and explore its connections to Maximum Mean Discrepancy. Our empirical results suggest that using the dual formulation for the restricted family of linear discriminators results in a more stable convergence to a desirable solution when compared with the performance of a primal min-max GAN-like objective and an MMD objective under the same restrictions. We test our hypothesis on the problem of aligning two synthetic point clouds on a plane and on a real-image domain adaptation problem on digits. In both cases, the dual formulation yields an iterative procedure that gives more stable and monotonic improvement over time.

Performer · 估計/估計量 · 經驗風險最小化 · 經驗風險 · 方差 ·

2017 年 12 月 14 日

Variance-based regularization with convex objectives

John Duchi,Hongseok Namkoong

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

DeepWalk · 網絡嵌入 · node2vec · Networking · 分解的 ·

2017 年 12 月 12 日

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec

Jiezhong Qiu,Yuxiao Dong,Hao Ma,Jian Li,Kuansan Wang,Jie Tang

from arxiv, 9 pages, published in WSDM 2018 proceedings

Since the invention of word2vec, the skip-gram model has significantly advanced the research of network embedding, such as the recent emergence of the DeepWalk, LINE, PTE, and node2vec approaches. In this work, we show that all of the aforementioned models with negative sampling can be unified into the matrix factorization framework with closed forms. Our analysis and proofs reveal that: (1) DeepWalk empirically produces a low-rank transformation of a network's normalized Laplacian matrix; (2) LINE, in theory, is a special case of DeepWalk when the size of vertices' context is set to one; (3) As an extension of LINE, PTE can be viewed as the joint factorization of multiple networks' Laplacians; (4) node2vec is factorizing a matrix related to the stationary distribution and transition probability tensor of a 2nd-order random walk. We further provide the theoretical connections between skip-gram based network embedding algorithms and the theory of graph Laplacian. Finally, we present the NetMF method as well as its approximation algorithm for computing network embedding. Our method offers significant improvements over DeepWalk and LINE for conventional network mining tasks. This work lays the theoretical foundation for skip-gram based network embedding methods, leading to a better understanding of latent network representation learning.