免费在线黄色电影_青青国产成人久久激情91_91久久精品美女高潮喷水APP_午夜刺激在线看免费视频_国产有色又爽又刺激视频_国产精品自拍欧美一区_色哟哟精彩视频一区国产在线

We study the generalization properties of the popular stochastic optimization method known as stochastic gradient descent (SGD) for optimizing general non-convex loss functions. Our main contribution is providing upper bounds on the generalization error that depend on local statistics of the stochastic gradients evaluated along the path of iterates calculated by SGD. The key factors our bounds depend on are the variance of the gradients (with respect to the data distribution) and the local smoothness of the objective function along the SGD path, and the sensitivity of the loss function to perturbations to the final output. Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates.

相關內容

泛化理論

關注 25

極小點 · 決策樹 · 情景 · INFORMS · 查準率/準確率 ·

2021 年 9 月 21 日

Query Minimization under Stochastic Uncertainty

Steven Chaplick,Magnús M. Halldórsson,Murilo S. de Lima,Tigran Tonoyan

from arxiv, To be published in Theoretical Computer Science. Since the previous version, the time consumption of the sorting algorithm was improved to $\mathrm{O}(n^5)$. Partially supported by Icelandic Research Fund grant 174484-051 and by EPSRC grant EP/S033483/1. A preliminary version of this paper appeared in volume 12118 of LNCS (LATIN 2020), pp. 181--193, 2020. DOI: 10.1007/978-3-030-61792-9_15

We study problems with stochastic uncertainty information on intervals for which the precise value can be queried by paying a cost. The goal is to devise an adaptive decision tree to find a correct solution to the problem in consideration while minimizing the expected total query cost. We show that, for the sorting problem, such a decision tree can be found in polynomial time. For the problem of finding the data item with minimum value, we have some evidence for hardness. This contradicts intuition, since the minimum problem is easier both in the online setting with adversarial inputs and in the offline verification setting. However, the stochastic assumption can be leveraged to beat both deterministic and randomized approximation lower bounds for the online setting.

INFORMS · Performer · 后驗分布 · 先驗概率 · 噪聲 ·

2021 年 9 月 21 日

The Theoretical Limit of Radar Target Detection

Dazhuan Xu,Nan Wang

from arxiv, 24 pages, 7 figures

In the field of radar target detection, the false alarm and detection probabilities are used as the universal indicator for detection performance evaluation so far, such as Neyman Person detector. In this paper, inspired by the thoughts of Shannon's information theory, the new system model introducing the target existent state variable v into a general radar system model is established for target detection in the presence of complex white Gaussian noise. The equivalent detection channel and the posterior probability distribution are derived based on the priori statistical characteristic of the noise, target scattering and existent state. The detection performance is measured by the false alarm and detection probabilities and the detection information that is defined as the mutual information between received signal and existent state. The false alarm theorem is proved that false alarm probability is equal to the prior probability of the target existence if the observation interval is large enough and the theorem is the basis for the performance comparison proposed detector with Neyman-Person detector. The sampling a posterior probability detector is proposed, and its performance is measured by the empirical detection information. The target detection theorem is proved that the detection information is the limit of the detection performance, that is, the detection information is achievable and the empirical detection information of any detector is no greater than the detection information. Simulation results verify the correctness of the false alarm and the target detection theorems, and show that the performance of the sampling a posterior probability detector is asymptotically optimal and outperforms other detectors. In addition, the detector is more favorable to detect the dim targets under the detection information than other detectors.

噪聲 · 隨機梯度下降 · 近似 · 超參數 · 學習率 ·

2021 年 9 月 20 日

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

Yixin Wu,Rui Luo,Chen Zhang,Jun Wang,Yaodong Yang

from arxiv, 18 pages

In this paper, we characterize the noise of stochastic gradients and analyze the noise-induced dynamics during training deep neural networks by gradient-based optimizers. Specifically, we firstly show that the stochastic gradient noise possesses finite variance, and therefore the classical Central Limit Theorem (CLT) applies; this indicates that the gradient noise is asymptotically Gaussian. Such an asymptotic result validates the wide-accepted assumption of Gaussian noise. We clarify that the recently observed phenomenon of heavy tails within gradient noise may not be intrinsic properties, but the consequence of insufficient mini-batch size; the gradient noise, which is a sum of limited i.i.d. random variables, has not reached the asymptotic regime of CLT, thus deviates from Gaussian. We quantitatively measure the goodness of Gaussian approximation of the noise, which supports our conclusion. Secondly, we analyze the noise-induced dynamics of stochastic gradient descent using the Langevin equation, granting for momentum hyperparameter in the optimizer with a physical interpretation. We then proceed to demonstrate the existence of the steady-state distribution of stochastic gradient descent and approximate the distribution at a small learning rate.

泛化誤差上界 · 方陣 · 泛化理論 · 泛化誤差 · BASIC ·

2021 年 9 月 20 日

`Basic' Generalization Error Bounds for Least Squares Regression with Well-specified Models

Karthik Duraisamy

This note examines the behavior of generalization capabilities - as defined by out-of-sample mean squared error (MSE) - of Linear Gaussian (with a fixed design matrix) and Linear Least Squares regression. Particularly, we consider a well-specified model setting, i.e. we assume that there exists a `true' combination of model parameters within the chosen model form. While the statistical properties of Least Squares regression have been extensively studied over the past few decades - particularly with {\bf less restrictive problem statements} compared to the present work - this note targets bounds that are {\bf non-asymptotic and more quantitative} compared to the literature. Further, the analytical formulae for distributions and bounds (on the MSE) are directly compared to numerical experiments. Derivations are presented in a self-contained and pedagogical manner, in a way that a reader with a basic knowledge of probability and statistics can follow.

非凸 · CASE · 對數幾率回歸 · 學成 · 類別 ·

2021 年 9 月 20 日

A Unified Convergence Analysis for Shuffling-Type Gradient Methods

Lam M. Nguyen,Quoc Tran-Dinh,Dzung T. Phan,Phuong Ha Nguyen,Marten van Dijk

from arxiv, Journal of Machine Learning Research, 2021

In this paper, we propose a unified convergence analysis for a class of generic shuffling-type gradient methods for solving finite-sum optimization problems. Our analysis works with any sampling without replacement strategy and covers many known variants such as randomized reshuffling, deterministic or randomized single permutation, and cyclic and incremental gradient schemes. We focus on two different settings: strongly convex and nonconvex problems, but also discuss the non-strongly convex case. Our main contribution consists of new non-asymptotic and asymptotic convergence rates for a wide class of shuffling-type gradient methods in both nonconvex and convex settings. We also study uniformly randomized shuffling variants with different learning rates and model assumptions. While our rate in the nonconvex case is new and significantly improved over existing works under standard assumptions, the rate on the strongly convex one matches the existing best-known rates prior to this paper up to a constant factor without imposing a bounded gradient condition. Finally, we empirically illustrate our theoretical results via two numerical examples: nonconvex logistic regression and neural network training examples. As byproducts, our results suggest some appropriate choices for diminishing learning rates in certain shuffling variants.

優化器 · 可約的 · SimPLe · 評論員 · 分解的 ·

2021 年 9 月 18 日

Graph-Theoretical Based Algorithms for Structural Optimization

Farzad S. Dizaji,Mehrdad S. Dizaji

Five new algorithms were proposed in order to optimize well conditioning of structural matrices. Along with decreasing the size and duration of analyses, minimizing analytical errors is a critical factor in the optimal computer analysis of skeletal structures. Appropriate matrices with a greater number of zeros (sparse), a well structure, and a well condition are advantageous for this objective. As a result, a problem of optimization with various goals will be addressed. This study seeks to minimize analytical errors such as rounding errors in skeletal structural flexibility matrixes via the use of more consistent and appropriate mathematical methods. These errors become more pronounced in particular designs with ill-suited flexibility matrixes; structures with varying stiffness are a frequent example of this. Due to the usage of weak elements, the flexibility matrix has a large number of non-diagonal terms, resulting in analytical errors. In numerical analysis, the ill-condition of a matrix may be resolved by moving or substituting rows; this study examined the definition and execution of these modifications prior to creating the flexibility matrix. Simple topological and algebraic features have been mostly utilized in this study to find fundamental cycle bases with particular characteristics. In conclusion, appropriately conditioned flexibility matrices are obtained, and analytical errors are reduced accordingly.

UniFormer · 正則化項 · Extensibility · 時間步 · 相互獨立的 ·

2021 年 9 月 18 日

Improved uniform error bounds for the time-splitting methods for the long-time dynamics of the Schr?dinger/nonlinear Schr?dinger equation

Weizhu Bao,Yongyong Cai,Yue Feng

We establish improved uniform error bounds for the time-splitting methods for the long-time dynamics of the Schr\"odinger equation with small potential and the nonlinear Schr\"odinger equation (NLSE) with weak nonlinearity. For the Schr\"odinger equation with small potential characterized by a dimensionless parameter $\varepsilon \in (0, 1]$ representing the amplitude of the potential, we employ the unitary flow property of the (second-order) time-splitting Fourier pseudospectral (TSFP) method in $L^2$-norm to prove a uniform error bound at $C(T)(h^m +\tau^2)$ up to the long time $T_\varepsilon= T/\varepsilon$ for any $T>0$ and uniformly for $0<\varepsilon\le1$, while $h$ is the mesh size, $\tau$ is the time step, $m \ge 2$ depends on the regularity of the exact solution, and $C(T) =C_0+C_1T$ grows at most linearly with respect to $T$ with $C_0$ and $C_1$ two positive constants independent of $T$, $\varepsilon$, $h$ and $\tau$. Then by introducing a new technique of {\sl regularity compensation oscillation} (RCO) in which the high frequency modes are controlled by regularity and the low frequency modes are analyzed by phase cancellation and energy method, an improved uniform error bound at $O(h^{m-1} + \varepsilon \tau^2)$ is established in $H^1$-norm for the long-time dynamics up to the time at $O(1/\varepsilon)$ of the Schr\"odinger equation with $O(\varepsilon)$-potential with $m \geq 3$, which is uniformly for $\varepsilon\in(0,1]$. Moreover, the RCO technique is extended to prove an improved uniform error bound at $O(h^{m-1} + \varepsilon^2\tau^2)$ in $H^1$-norm for the long-time dynamics up to the time at $O(1/\varepsilon^2)$ of the cubic NLSE with $O(\varepsilon^2)$-nonlinearity strength, uniformly for $\varepsilon \in (0, 1]$. Extensions to the first-order and fourth-order time-splitting methods are discussed.

隨機梯度下降 · 規范化的 · Batch Size · 優化器 · 寬度 ·

2019 年 5 月 9 日

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Daniel S. Park,Jascha Sohl-Dickstein,Quoc V. Le,Samuel L. Smith

from arxiv, 17 pages, 3 tables, 17 figures; accepted to ICML 2019

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

隨機梯度下降 · ReLU · 優化器 · Networking · 修正線性單元/整流線性單元 ·

2018 年 11 月 21 日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Difan Zou,Yuan Cao,Dongruo Zhou,Quanquan Gu

from arxiv, 47 pages

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

單純形 · Performer · Processing（編程語言） · 貝葉斯推斷 · 離散化 ·

2018 年 6 月 19 日

Large-Scale Stochastic Sampling from the Probability Simplex

Jack Baker,Paul Fearnhead,Emily B Fox,Christopher Nemeth

Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space, such as the simplex, the time-discretisation error can dominate when we are near the boundary of the space. We demonstrate that while current SGMCMC methods for the simplex perform well in certain cases, they struggle with sparse simplex spaces; when many of the components are close to zero. However, most popular large-scale applications of Bayesian inference on simplex spaces, such as network or topic models, are sparse. We argue that this poor performance is due to the biases of SGMCMC caused by the discretization error. To get around this, we propose the stochastic CIR process, which removes all discretization error and we prove that samples from the stochastic CIR process are asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.