三级电影一区二区三区,五月婷婷开心之中文字幕,天天躁夜夜躁狠狠躁2023牛牛,国精品无码A区一区二区

Stochastic gradient descent (SGD) is a promising method for solving large-scale inverse problems, due to its excellent scalability with respect to data size. The current mathematical theory in the lens of regularization theory predicts that SGD with a polynomially decaying stepsize schedule may suffer from an undesirable saturation phenomenon, i.e., the convergence rate does not further improve with the solution regularity index when it is beyond a certain range. In this work, we present a refined convergence rate analysis of SGD, and prove that saturation actually does not occur if the initial stepsize of the schedule is sufficiently small. Several numerical experiments are provided to complement the analysis.

相關內容

隨機梯度下降

關注 19

隨機梯度下降，按照數據生成分布抽取m個樣本，通過計算他們梯度的平均值來更新梯度。

學習率 · 學成 · 泛化理論 · 深度模型 · Performance ·

2021 年 10 月 7 日

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

Yuqing Wang,Minshuo Chen,Tuo Zhao,Molei Tao

from arxiv, Questions and comments are welcome

Recent empirical advances show that training deep models with large learning rate often improves generalization performance. However, theoretical justifications on the benefits of large learning rate are highly limited, due to challenges in analysis. In this paper, we consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem, i.e., $\min_{X, Y} \|A - XY^\top\|_{\sf F}^2$. We prove a convergence theory for constant large learning rates well beyond $2/L$, where $L$ is the largest eigenvalue of Hessian at the initialization. Moreover, we rigorously establish an implicit bias of GD induced by such a large learning rate, termed 'balancing', meaning that magnitudes of $X$ and $Y$ at the limit of GD iterations will be close even if their initialization is significantly unbalanced. Numerical experiments are provided to support our theory.

正則化項 · 約束 · 離散化 · 特化 · 分段 ·

2021 年 10 月 7 日

On the asymptotical regularization with convex constraints for inverse problems

Min Zhong,Wang Wei

from arxiv, 11 pages

In this paper, we consider the asymptotical regularization with convex constraints for nonlinear ill-posed problems. The method allows to use non-smooth penalty terms, including the L1-like and the total variation-like penalty functionals, which are significant in reconstructing special features of solutions such as sparsity and piecewise constancy. Under certain conditions we give convergence properties of the methods. Moreover, we propose Runge-Kutta type methods to discrete the initial value problems to construct new type iterative regularization methods.

在線推斷 · 隨機梯度下降 · SGD · 推斷 · 穩健性 ·

2021 年 10 月 7 日

Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling

Sokbae Lee,Yuan Liao,Myung Hwan Seo,Youngki Shin

from arxiv, 29 pages, 8 figures, 8 tables

We develop a new method of online inference for a vector of parameters estimated by the Polyak-Ruppert averaging procedure of stochastic gradient descent (SGD) algorithms. We leverage insights from time series regression in econometrics and construct asymptotically pivotal statistics via random scaling. Our approach is fully operational with online data and is rigorously underpinned by a functional central limit theorem. Our proposed inference method has a couple of key advantages over the existing methods. First, the test statistic is computed in an online fashion with only SGD iterates and the critical values can be obtained without any resampling methods, thereby allowing for efficient implementation suitable for massive online data. Second, there is no need to estimate the asymptotic variance and our inference method is shown to be robust to changes in the tuning parameters for SGD algorithms in simulation experiments with synthetic data.

卡爾曼濾波 · 線性的 · 方陣 · 均方誤差 · 均值 ·

2021 年 10 月 6 日

Iterate Averaging and Filtering Algorithms for Linear Inverse Problems

Felix G. Jones,Gideon Simpson

from arxiv, initial posting, 16 pages, 3 figures

It has been proposed that classical filtering methods, like the Kalman filter and 3DVAR, can be used to solve linear statistical inverse problems. In the work of Igelsias, Lin, Lu, & Stuart (2017), error estimates were obtained for this approach. By optimally tuning a free parameter in the filters, the authors were able to show that the mean squared error can be minimized. In the present work, we prove that by (i) considering the problem in a weaker, weighted, space and (ii) applying simple iterate averaging of the filter output, 3DVAR will converge in mean square, unconditionally on the parameter. Without iterate averaging, 3DVAR cannot converge by running additional iterations with a given, fixed, choice of parameter. We also establish that the Kalman filter's performance cannot be improved through iterate averaging. We illustrate our results with numerical experiments that suggest our convergence rates are sharp.

ResNet · 全局極小解 · 優化器 · Processing（編程語言） · PDE ·

2021 年 10 月 6 日

On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regime

Zhiyan Ding,Shi Chen,Qin Li,Stephen Wright

from arxiv, arXiv admin note: text overlap with arXiv:2105.14417

Finding the optimal configuration of parameters in ResNet is a nonconvex minimization problem, but first-order methods nevertheless find the global optimum in the overparameterized regime. We study this phenomenon with mean-field analysis, by translating the training process of ResNet to a gradient-flow partial differential equation (PDE) and examining the convergence properties of this limiting process. The activation function is assumed to be $2$-homogeneous or partially $1$-homogeneous; the regularized ReLU satisfies the latter condition. We show that if the ResNet is sufficiently large, with depth and width depending algebraically on the accuracy and confidence levels, first-order optimization methods can find global minimizers that fit the training data.

線性的 · AdaGrad · 優化器 · 隨機梯度下降 · 泛化理論 ·

2021 年 10 月 6 日

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors

Adityanarayanan Radhakrishnan,Mikhail Belkin,Caroline Uhler

The Polyak-Lojasiewicz (PL) inequality is a sufficient condition for establishing linear convergence of gradient descent, even in non-convex settings. While several recent works use a PL-based analysis to establish linear convergence of stochastic gradient descent methods, the question remains as to whether a similar analysis can be conducted for more general optimization methods. In this work, we present a PL-based analysis for linear convergence of generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. Since the standard PL analysis cannot be extended naturally from GMD to stochastic GMD, we present a Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, for functions that are locally PL*, our analysis implies existence of an interpolating solution and convergence of GMD to this solution.

簇 · Better · 優化器 · Extensibility · 隨機近鄰嵌入 ·

2021 年 10 月 6 日

T-SNE Is Not Optimized to Reveal Clusters in Data

Zhirong Yang,Yuwei Chen,Jukka Corander

from arxiv, arXiv admin note: text overlap with arXiv:2108.08003

Cluster visualization is an essential task for nonlinear dimensionality reduction as a data analysis tool. It is often believed that Student t-Distributed Stochastic Neighbor Embedding (t-SNE) can show clusters for well clusterable data, with a smaller Kullback-Leibler divergence corresponding to a better quality. There was even theoretical proof for the guarantee of this property. However, we point out that this is not necessarily the case -- t-SNE may leave clustering patterns hidden despite strong signals present in the data. Extensive empirical evidence is provided to support our claim. First, several real-world counter-examples are presented, where t-SNE fails even if the input neighborhoods are well clusterable. Tuning hyperparameters in t-SNE or using better optimization algorithms does not help solve this issue because a better t-SNE learning objective can correspond to a worse cluster embedding. Second, we check the assumptions in the clustering guarantee of t-SNE and find they are often violated for real-world data sets.

秩 · 非凸 · SimPLe · Performance · 向量化 ·

2021 年 10 月 6 日

Blind Super-resolution via Projected Gradient Descent

Sihan Mao,Jinchi Chen

Blind super-resolution can be cast as low rank matrix recovery problem by exploiting the inherent simplicity of the signal. In this paper, we develop a simple yet efficient nonconvex method for this problem based on the low rank structure of the vectorized Hankel matrix associated with the target matrix. Theoretical guarantees have been established under the similar conditions as convex approaches. Numerical experiments are also conducted to demonstrate its performance.

隨機梯度下降 · ReLU · 優化器 · Networking · 修正線性單元/整流線性單元 ·

2018 年 11 月 21 日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Difan Zou,Yuan Cao,Dongruo Zhou,Quanquan Gu

from arxiv, 47 pages

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

坐標下降 · 優化器 · Performer · 學成 · 在線 ·

2018 年 7 月 16 日

Accelerated Randomized Coordinate Descent Algorithms for Stochastic Optimization and Online Learning

Akshita Bhandari,Chandramani Singh

from arxiv, 20 pages, 4 figures, 2 tables

We propose accelerated randomized coordinate descent algorithms for stochastic optimization and online learning. Our algorithms have significantly less per-iteration complexity than the known accelerated gradient algorithms. The proposed algorithms for online learning have better regret performance than the known randomized online coordinate descent algorithms. Furthermore, the proposed algorithms for stochastic optimization exhibit as good convergence rates as the best known randomized coordinate descent algorithms. We also show simulation results to demonstrate performance of the proposed algorithms.