亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant step stochastic iterative algorithms do not converge asymptotically to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithms with smooth and strongly convex objective, (2) linear SA algorithms involving a Hurwitz matrix, and (3) nonlinear SA algorithms involving a contractive operator. When the iterate is scaled by $1/\sqrt{\alpha}$, where $\alpha$ is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an integral equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be $1/\sqrt{\alpha}$, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.

相關內容

Motivated by the learned iterative soft thresholding algorithm (LISTA), we introduce a general class of neural networks suitable for sparse reconstruction from few linear measurements. By allowing a wide range of degrees of weight-sharing between the layers, we enable a unified analysis for very different neural network types, ranging from recurrent ones to networks more similar to standard feedforward neural networks. Based on training samples, via empirical risk minimization we aim at learning the optimal network parameters and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks, that also take into account the thresholding parameters. We obtain estimates of the sample complexity that essentially depend only linearly on the number of parameters and on the depth. We apply our main result to obtain specific generalization bounds for several practical examples, including different algorithms for (implicit) dictionary learning, and convolutional neural networks.

We consider the deformation of a geological structure with non-intersecting faults that can be represented by a layered system of viscoelastic bodies satisfying rate- and state-depending friction conditions along the common interfaces. We derive a mathematical model that contains classical Dieterich- and Ruina-type friction as special cases and accounts for possibly large tangential displacements. Semi-discretization in time by a Newmark scheme leads to a coupled system of non-smooth, convex minimization problems for rate and state to be solved in each time step. Additional spatial discretization by a mortar method and piecewise constant finite elements allows for the decoupling of rate and state by a fixed point iteration and efficient algebraic solution of the rate problem by truncated non-smooth Newton methods. Numerical experiments with a spring slider and a layered multiscale system illustrate the behavior of our model as well as the efficiency and reliability of the numerical solver.

Partial Differential Equations (PDEs) describe several problems relevant to many fields of applied sciences, and their discrete counterparts typically involve the solution of sparse linear systems. In this context, we focus on the analysis of the computational aspects related to the solution of large and sparse linear systems with HPC solvers, by considering the performances of direct and iterative solvers in terms of computational efficiency, scalability, and numerical accuracy. Our aim is to identify the main criteria to support application-domain specialists in the selection of the most suitable solvers, according to the application requirements and available resources. To this end, we discuss how the numerical solver is affected by the regular/irregular discretisation of the input domain, the discretisation of the input PDE with piecewise linear or polynomial basis functions, which generally result in a higher/lower sparsity of the coefficient matrix, and the choice of different initial conditions, which are associated with linear systems with multiple right-hand side terms. Finally, our analysis is independent of the characteristics of the underlying computational architectures, and provides a methodological approach that can be applied to different classes of PDEs or with approximation problems.

We develop an essentially optimal finite element approach for solving ergodic stochastic two-scale elliptic equations whose two-scale coefficient may depend also on the slow variable. We solve the limiting stochastic two-scale homogenized equation obtained from the stochastic two-scale convergence in the mean (A. Bourgeat, A. Mikelic and S. Wright, J. reine angew. Math, Vol. 456, 1994), whose solution comprises of the solution to the homogenized equation and the corrector, by truncating the infinite domain of the fast variable and using the sparse tensor product finite elements. We show that the convergence rate in terms of the truncation level is equivalent to that for solving the cell problems in the same truncated domain. Solving this equation, we obtain the solution to the homogenized equation and the corrector at the same time, using only a number of degrees of freedom that is essentially equivalent to that required for solving one cell problem. Optimal complexity is obtained when the corrector possesses sufficient regularity with respect to both the fast and the slow variables. Although the regularity norm of the corrector depends on the size of the truncated domain, we show that the convergence rate of the approximation for the solution to the homogenized equation is independent of the size of the truncated domain. With the availability of an analytic corrector, we construct a numerical corrector for the solution of the original stochastic two-scale equation from the finite element solution to the truncated stochastic two-scale homogenized equation. Numerical examples of quasi-periodic two-scale equations, and a stochastic two-scale equation of the checker board type, whose coefficient is discontinuous, confirm the theoretical results.

We establish how the coefficients of a sparse polynomial system influence the sum (or the trace) of its zeros. As an application, we develop numerical tests for verifying whether a set of solutions to a sparse system is complete. These algorithms extend the classical trace test in numerical algebraic geometry. Our results rely on both the analysis of the structure of sparse resultants as well as an extension of Esterov's results on monodromy groups of sparse systems.

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

Influence maximization is the task of selecting a small number of seed nodes in a social network to maximize the spread of the influence from these seeds, and it has been widely investigated in the past two decades. In the canonical setting, the whole social network as well as its diffusion parameters is given as input. In this paper, we consider the more realistic sampling setting where the network is unknown and we only have a set of passively observed cascades that record the set of activated nodes at each diffusion step. We study the task of influence maximization from these cascade samples (IMS), and present constant approximation algorithms for this task under mild conditions on the seed set distribution. To achieve the optimization goal, we also provide a novel solution to the network inference problem, that is, learning diffusion parameters and the network structure from the cascade data. Comparing with prior solutions, our network inference algorithm requires weaker assumptions and does not rely on maximum-likelihood estimation and convex programming. Our IMS algorithms enhance the learning-and-then-optimization approach by allowing a constant approximation ratio even when the diffusion parameters are hard to learn, and we do not need any assumption related to the network structure or diffusion parameters.

Optimal transport distances have found many applications in machine learning for their capacity to compare non-parametric probability distributions. Yet their algorithmic complexity generally prevents their direct use on large scale datasets. Among the possible strategies to alleviate this issue, practitioners can rely on computing estimates of these distances over subsets of data, {\em i.e.} minibatches. While computationally appealing, we highlight in this paper some limits of this strategy, arguing it can lead to undesirable smoothing effects. As an alternative, we suggest that the same minibatch strategy coupled with unbalanced optimal transport can yield more robust behavior. We discuss the associated theoretical properties, such as unbiased estimators, existence of gradients and concentration bounds. Our experimental study shows that in challenging problems associated to domain adaptation, the use of unbalanced optimal transport leads to significantly better results, competing with or surpassing recent baselines.

We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

北京阿比特科技有限公司