蜜芽亚洲精品国产品国语在线试看,又黄又爽视频在线观看WWW,大荫蒂黑又大毛茸茸高清,日本黄色视频网址免费观看,无码国产AV中文字色片

from arxiv, Camera-ready version for the Thirty-eighth International Conference on Machine Learning (ICML 2021). 12 + 14 pages, 6 + 3 figures, 1 + 0 table. *First two authors contributed equally

In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one implication of our result is that the fluctuation caused by discrete-time dynamics takes a distorted shape and is dramatically larger than a continuous-time theory could predict. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of minibatch noise, the optimal Bayesian inference, the escape rate from a sharp minimum, and the stationary covariance of a few second-order methods including damped Newton's method, natural gradient descent, and Adam.

相關內容

隨機梯度下降

關注 19

隨機梯度下降，按照數據生成分布抽取m個樣本，通過計算他們梯度的平均值來更新梯度。

SGD · 正則化項 · 嶺回歸 · 泛化理論 · 方陣 ·

2021 年 8 月 10 日

The Benefits of Implicit Regularization from SGD in Least Squares Problems

Difan Zou,Jingfeng Wu,Vladimir Braverman,Quanquan Gu,Dean P. Foster,Sham M. Kakade

from arxiv, 39 pages, 1 figure

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.

動量 · SGD · 隨機梯度下降 · Continuity · 線性的 ·

2021 年 8 月 9 日

On the Hyperparameters in Stochastic Gradient Descent with Momentum

Bin Shi

Following the same routine as [SSJ20], we continue to present the theoretical analysis for stochastic gradient descent with momentum (SGD with momentum) in this paper. Differently, for SGD with momentum, we demonstrate it is the two hyperparameters together, the learning rate and the momentum coefficient, that play the significant role for the linear rate of convergence in non-convex optimization. Our analysis is based on the use of a hyperparameters-dependent stochastic differential equation (hp-dependent SDE) that serves as a continuous surrogate for SGD with momentum. Similarly, we establish the linear convergence for the continuous-time formulation of SGD with momentum and obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal linear rate of convergence and the final gap for SGD only about the learning rate varies with the momentum coefficient increasing from zero to one when the momentum is introduced. Then, we propose a mathematical interpretation why the SGD with momentum converges faster and more robust about the learning rate than the standard SGD in practice. Finally, we show the Nesterov momentum under the existence of noise has no essential difference with the standard momentum.

欠定的 · 估計/估計量 · 雅可比矩陣 · 雅克比 · 秩 ·

2021 年 8 月 9 日

A doubly relaxed minimal-norm Gauss-Newton method for underdetermined nonlinear least-squares problems

Federica Pes,Giuseppe Rodriguez

When a physical system is modeled by a nonlinear function, the unknown parameters can be estimated by fitting experimental observations by a least-squares approach. Newton's method and its variants are often used to solve problems of this type. In this paper, we are concerned with the computation of the minimal-norm solution of an underdetermined nonlinear least-squares problem. We present a Gauss-Newton type method, which relies on two relaxation parameters to ensure convergence, and which incorporates a procedure to dynamically estimate the two parameters, as well as the rank of the Jacobian matrix, along the iterations. Numerical results are presented.

逼真度 · 優化器 · 線性的 · 可辨認的 · 方陣 ·

2021 年 8 月 8 日

On the Convergence Rate of Projected Gradient Descent for a Back-Projection based Objective

Tom Tirer,Raja Giryes

from arxiv, Accepted to SIAM Journal on Imaging Sciences (SIIMS)

Ill-posed linear inverse problems appear in many scientific setups, and are typically addressed by solving optimization problems, which are composed of data fidelity and prior terms. Recently, several works have considered a back-projection (BP) based fidelity term as an alternative to the common least squares (LS), and demonstrated excellent results for popular inverse problems. These works have also empirically shown that using the BP term, rather than the LS term, requires fewer iterations of optimization algorithms. In this paper, we examine the convergence rate of the projected gradient descent (PGD) algorithm for the BP objective. Our analysis allows to identify an inherent source for its faster convergence compared to using the LS objective, while making only mild assumptions. We also analyze the more general proximal gradient method under a relaxed contraction condition on the proximal mapping of the prior. This analysis further highlights the advantage of BP when the linear measurement operator is badly conditioned. Numerical experiments with both $\ell_1$-norm and GAN-based priors corroborate our theoretical results.

隨機梯度下降 · SGD · 正則化項 · 飽和 · 線性的 ·

2021 年 8 月 7 日

On the Saturation Phenomenon of Stochastic Gradient Descent for Linear Inverse Problems

Bangti Jin,Zehui Zhou,Jun Zou

from arxiv, to appear at SIAM/ASA J. Uncertainty Quantification, with error corrected

Stochastic gradient descent (SGD) is a promising method for solving large-scale inverse problems, due to its excellent scalability with respect to data size. The current mathematical theory in the lens of regularization theory predicts that SGD with a polynomially decaying stepsize schedule may suffer from an undesirable saturation phenomenon, i.e., the convergence rate does not further improve with the solution regularity index when it is beyond a certain range. In this work, we present a refined convergence rate analysis of SGD, and prove that saturation actually does not occur if the initial stepsize of the schedule is sufficiently small. Several numerical experiments are provided to complement the analysis.

數據集增強 · Continuity · 泛函 · INFORMS · 樣本 ·

2021 年 8 月 6 日

Quantum Quantile Mechanics: Solving Stochastic Differential Equations for Generating Time-Series

Annie E. Paine,Vincent E. Elfving,Oleksandr Kyriienko

from arxiv, first draft, to be updated

We propose a quantum algorithm for sampling from a solution of stochastic differential equations (SDEs). Using differentiable quantum circuits (DQCs) with a feature map encoding of latent variables, we represent the quantile function for an underlying probability distribution and extract samples as DQC expectation values. Using quantile mechanics we propagate the system in time, thereby allowing for time-series generation. We test the method by simulating the Ornstein-Uhlenbeck process and sampling at times different from the initial point, as required in financial analysis and dataset augmentation. Additionally, we analyse continuous quantum generative adversarial networks (qGANs), and show that they represent quantile functions with a modified (reordered) shape that impedes their efficient time-propagation. Our results shed light on the connection between quantum quantile mechanics (QQM) and qGANs for SDE-based distributions, and point the importance of differential constraints for model training, analogously with the recent success of physics informed neural networks.

估計/估計量 · 估計誤差 · MoDELS · 學成 · 無偏 ·

2020 年 12 月 17 日

The Causal Learning of Retail Delinquency

Yiyan Huang,Cheuk Hang Leung,Xing Yan,Qi Wu,Nanbo Peng,Dongdong Wang,Zhixiang Huang

from arxiv, This paper was accepted and will be published in the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.

隨機梯度下降 · 規范化的 · Batch Size · 優化器 · 寬度 ·

2019 年 5 月 9 日

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Daniel S. Park,Jascha Sohl-Dickstein,Quoc V. Le,Samuel L. Smith

from arxiv, 17 pages, 3 tables, 17 figures; accepted to ICML 2019

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

隨機梯度下降 · ReLU · 優化器 · Networking · 修正線性單元/整流線性單元 ·

2018 年 11 月 21 日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Difan Zou,Yuan Cao,Dongruo Zhou,Quanquan Gu

from arxiv, 47 pages

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

坐標下降 · 優化器 · Performer · 學成 · 在線 ·

2018 年 7 月 16 日

Accelerated Randomized Coordinate Descent Algorithms for Stochastic Optimization and Online Learning

Akshita Bhandari,Chandramani Singh

from arxiv, 20 pages, 4 figures, 2 tables

We propose accelerated randomized coordinate descent algorithms for stochastic optimization and online learning. Our algorithms have significantly less per-iteration complexity than the known accelerated gradient algorithms. The proposed algorithms for online learning have better regret performance than the known randomized online coordinate descent algorithms. Furthermore, the proposed algorithms for stochastic optimization exhibit as good convergence rates as the best known randomized coordinate descent algorithms. We also show simulation results to demonstrate performance of the proposed algorithms.