We study the generalization properties of the popular stochastic optimization method known as stochastic gradient descent (SGD) for optimizing general non-convex loss functions. Our main contribution is providing upper bounds on the generalization error that depend on local statistics of the stochastic gradients evaluated along the path of iterates calculated by SGD. The key factors our bounds depend on are the variance of the gradients (with respect to the data distribution) and the local smoothness of the objective function along the SGD path, and the sensitivity of the loss function to perturbations to the final output. Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates.
We study problems with stochastic uncertainty information on intervals for which the precise value can be queried by paying a cost. The goal is to devise an adaptive decision tree to find a correct solution to the problem in consideration while minimizing the expected total query cost. We show that, for the sorting problem, such a decision tree can be found in polynomial time. For the problem of finding the data item with minimum value, we have some evidence for hardness. This contradicts intuition, since the minimum problem is easier both in the online setting with adversarial inputs and in the offline verification setting. However, the stochastic assumption can be leveraged to beat both deterministic and randomized approximation lower bounds for the online setting.
In the field of radar target detection, the false alarm and detection probabilities are used as the universal indicator for detection performance evaluation so far, such as Neyman Person detector. In this paper, inspired by the thoughts of Shannon's information theory, the new system model introducing the target existent state variable v into a general radar system model is established for target detection in the presence of complex white Gaussian noise. The equivalent detection channel and the posterior probability distribution are derived based on the priori statistical characteristic of the noise, target scattering and existent state. The detection performance is measured by the false alarm and detection probabilities and the detection information that is defined as the mutual information between received signal and existent state. The false alarm theorem is proved that false alarm probability is equal to the prior probability of the target existence if the observation interval is large enough and the theorem is the basis for the performance comparison proposed detector with Neyman-Person detector. The sampling a posterior probability detector is proposed, and its performance is measured by the empirical detection information. The target detection theorem is proved that the detection information is the limit of the detection performance, that is, the detection information is achievable and the empirical detection information of any detector is no greater than the detection information. Simulation results verify the correctness of the false alarm and the target detection theorems, and show that the performance of the sampling a posterior probability detector is asymptotically optimal and outperforms other detectors. In addition, the detector is more favorable to detect the dim targets under the detection information than other detectors.
In this paper, we characterize the noise of stochastic gradients and analyze the noise-induced dynamics during training deep neural networks by gradient-based optimizers. Specifically, we firstly show that the stochastic gradient noise possesses finite variance, and therefore the classical Central Limit Theorem (CLT) applies; this indicates that the gradient noise is asymptotically Gaussian. Such an asymptotic result validates the wide-accepted assumption of Gaussian noise. We clarify that the recently observed phenomenon of heavy tails within gradient noise may not be intrinsic properties, but the consequence of insufficient mini-batch size; the gradient noise, which is a sum of limited i.i.d. random variables, has not reached the asymptotic regime of CLT, thus deviates from Gaussian. We quantitatively measure the goodness of Gaussian approximation of the noise, which supports our conclusion. Secondly, we analyze the noise-induced dynamics of stochastic gradient descent using the Langevin equation, granting for momentum hyperparameter in the optimizer with a physical interpretation. We then proceed to demonstrate the existence of the steady-state distribution of stochastic gradient descent and approximate the distribution at a small learning rate.
This note examines the behavior of generalization capabilities - as defined by out-of-sample mean squared error (MSE) - of Linear Gaussian (with a fixed design matrix) and Linear Least Squares regression. Particularly, we consider a well-specified model setting, i.e. we assume that there exists a `true' combination of model parameters within the chosen model form. While the statistical properties of Least Squares regression have been extensively studied over the past few decades - particularly with {\bf less restrictive problem statements} compared to the present work - this note targets bounds that are {\bf non-asymptotic and more quantitative} compared to the literature. Further, the analytical formulae for distributions and bounds (on the MSE) are directly compared to numerical experiments. Derivations are presented in a self-contained and pedagogical manner, in a way that a reader with a basic knowledge of probability and statistics can follow.
In this paper, we propose a unified convergence analysis for a class of generic shuffling-type gradient methods for solving finite-sum optimization problems. Our analysis works with any sampling without replacement strategy and covers many known variants such as randomized reshuffling, deterministic or randomized single permutation, and cyclic and incremental gradient schemes. We focus on two different settings: strongly convex and nonconvex problems, but also discuss the non-strongly convex case. Our main contribution consists of new non-asymptotic and asymptotic convergence rates for a wide class of shuffling-type gradient methods in both nonconvex and convex settings. We also study uniformly randomized shuffling variants with different learning rates and model assumptions. While our rate in the nonconvex case is new and significantly improved over existing works under standard assumptions, the rate on the strongly convex one matches the existing best-known rates prior to this paper up to a constant factor without imposing a bounded gradient condition. Finally, we empirically illustrate our theoretical results via two numerical examples: nonconvex logistic regression and neural network training examples. As byproducts, our results suggest some appropriate choices for diminishing learning rates in certain shuffling variants.
Five new algorithms were proposed in order to optimize well conditioning of structural matrices. Along with decreasing the size and duration of analyses, minimizing analytical errors is a critical factor in the optimal computer analysis of skeletal structures. Appropriate matrices with a greater number of zeros (sparse), a well structure, and a well condition are advantageous for this objective. As a result, a problem of optimization with various goals will be addressed. This study seeks to minimize analytical errors such as rounding errors in skeletal structural flexibility matrixes via the use of more consistent and appropriate mathematical methods. These errors become more pronounced in particular designs with ill-suited flexibility matrixes; structures with varying stiffness are a frequent example of this. Due to the usage of weak elements, the flexibility matrix has a large number of non-diagonal terms, resulting in analytical errors. In numerical analysis, the ill-condition of a matrix may be resolved by moving or substituting rows; this study examined the definition and execution of these modifications prior to creating the flexibility matrix. Simple topological and algebraic features have been mostly utilized in this study to find fundamental cycle bases with particular characteristics. In conclusion, appropriately conditioned flexibility matrices are obtained, and analytical errors are reduced accordingly.
We establish improved uniform error bounds for the time-splitting methods for the long-time dynamics of the Schr\"odinger equation with small potential and the nonlinear Schr\"odinger equation (NLSE) with weak nonlinearity. For the Schr\"odinger equation with small potential characterized by a dimensionless parameter $\varepsilon \in (0, 1]$ representing the amplitude of the potential, we employ the unitary flow property of the (second-order) time-splitting Fourier pseudospectral (TSFP) method in $L^2$-norm to prove a uniform error bound at $C(T)(h^m +\tau^2)$ up to the long time $T_\varepsilon= T/\varepsilon$ for any $T>0$ and uniformly for $0<\varepsilon\le1$, while $h$ is the mesh size, $\tau$ is the time step, $m \ge 2$ depends on the regularity of the exact solution, and $C(T) =C_0+C_1T$ grows at most linearly with respect to $T$ with $C_0$ and $C_1$ two positive constants independent of $T$, $\varepsilon$, $h$ and $\tau$. Then by introducing a new technique of {\sl regularity compensation oscillation} (RCO) in which the high frequency modes are controlled by regularity and the low frequency modes are analyzed by phase cancellation and energy method, an improved uniform error bound at $O(h^{m-1} + \varepsilon \tau^2)$ is established in $H^1$-norm for the long-time dynamics up to the time at $O(1/\varepsilon)$ of the Schr\"odinger equation with $O(\varepsilon)$-potential with $m \geq 3$, which is uniformly for $\varepsilon\in(0,1]$. Moreover, the RCO technique is extended to prove an improved uniform error bound at $O(h^{m-1} + \varepsilon^2\tau^2)$ in $H^1$-norm for the long-time dynamics up to the time at $O(1/\varepsilon^2)$ of the cubic NLSE with $O(\varepsilon^2)$-nonlinearity strength, uniformly for $\varepsilon \in (0, 1]$. Extensions to the first-order and fourth-order time-splitting methods are discussed.
We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.
We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.
Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space, such as the simplex, the time-discretisation error can dominate when we are near the boundary of the space. We demonstrate that while current SGMCMC methods for the simplex perform well in certain cases, they struggle with sparse simplex spaces; when many of the components are close to zero. However, most popular large-scale applications of Bayesian inference on simplex spaces, such as network or topic models, are sparse. We argue that this poor performance is due to the biases of SGMCMC caused by the discretization error. To get around this, we propose the stochastic CIR process, which removes all discretization error and we prove that samples from the stochastic CIR process are asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.