亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Finding the optimal configuration of parameters in ResNet is a nonconvex minimization problem, but first-order methods nevertheless find the global optimum in the overparameterized regime. We study this phenomenon with mean-field analysis, by translating the training process of ResNet to a gradient-flow partial differential equation (PDE) and examining the convergence properties of this limiting process. The activation function is assumed to be $2$-homogeneous or partially $1$-homogeneous; the regularized ReLU satisfies the latter condition. We show that if the ResNet is sufficiently large, with depth and width depending algebraically on the accuracy and confidence levels, first-order optimization methods can find global minimizers that fit the training data.

相關內容

We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise $\sigma^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $\kappa$, we prove that $T$ iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an $\tilde{O} \left(\exp \left( \frac{-T}{\kappa} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowing $\sigma^2$. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal $\tilde{O} \left(\exp \left( \frac{-T}{\sqrt{\kappa}} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowledge of $\sigma^2$. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. We empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.

We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report three properties that seem to be new in the literature of policy gradient methods: (1) HPMD exhibits global linear convergence of the value optimality gap, and local superlinear convergence of the policy to the set of optimal policies with order $\gamma^{-2}$. The superlinear convergence of the policy takes effect after no more than $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function; (2) HPMD also exhibits last-iterate convergence of the policy, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state. No regularization is added to the optimization objective and hence the second observation arises solely as an algorithmic property of the homotopic policy gradient method. (3) For the stochastic HPMD method, we further demonstrate a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| / \epsilon^2)$ sample complexity for small optimality gap $\epsilon$, when assuming a generative model for policy evaluation.

The arboricity of a graph is the minimum number of forests required to cover all its edges. In this paper, we examine arboricity from a game-theoretic perspective and investigate cost-sharing in the minimum forest cover problem. We introduce the arboricity game as a cooperative cost game defined on a graph. The players are edges, and the cost of each coalition is the arboricity of the subgraph induced by the coalition. We study properties of the core and propose an efficient algorithm for computing the nucleolus when the core is not empty. In order to compute the nucleolus in the core, we introduce the prime partition which is built on the densest subgraph lattice. The prime partition decomposes the edge set of a graph into a partially ordered set defined from minimal densest minors and their invariant precedence relation. Moreover, edges from the same partition always have the same value in a core allocation. Consequently, when the core is not empty, the prime partition significantly reduces the number of variables and constraints required in the linear programs of Maschler's scheme and allows us to compute the nucleolus in polynomial time. Besides, the prime partition provides a graph decomposition analogous to the celebrated core decomposition and the density-friendly decomposition, which may be of independent interest.

In this paper we provide a rigorous convergence analysis for the renowned Particle Swarm Optimization method using tools from stochastic calculus and the analysis of partial differential equations. Based on a time-continuous formulation of the particle dynamics as a system of stochastic differential equations, we establish the convergence to a global minimizer in two steps. First, we prove the consensus formation of the dynamics by analyzing the time-evolution of the variance of the particle distribution. Consecutively, we show that this consensus is close to a global minimizer by employing the asymptotic Laplace principle and a tractability condition on the energy landscape of the objective function. Our results allow for the usage of memory mechanisms, and hold for a rich class of objectives provided certain conditions of well-preparation of the hyperparameters and the initial datum are satisfied. To demonstrate the applicability of the method we propose an efficient and parallelizable implementation, which is tested in particular on a competitive and well-understood high-dimensional benchmark problem in machine learning.

We study the problem of group testing with non-identical, independent priors. So far, the pooling strategies that have been proposed in the literature take the following approach: a hand-crafted test design along with a decoding strategy is proposed, and guarantees are provided on how many tests are sufficient in order to identify all infections in a population. In this paper, we take a different, yet perhaps more practical, approach: we fix the decoder and the number of tests, and we ask, given these, what is the best test design one could use? We explore this question for the Definite Non-Defectives (DND) decoder. We formulate a (non-convex) optimization problem, where the objective function is the expected number of errors for a particular design. We find approximate solutions via gradient descent, which we further optimize with informed initialization. We illustrate through simulations that our method can achieve significant performance improvement over traditional approaches.

We study the overparametrization bounds required for the global convergence of stochastic gradient descent algorithm for a class of one hidden layer feed-forward neural networks, considering most of the activation functions used in practice, including ReLU. We improve the existing state-of-the-art results in terms of the required hidden layer width. We introduce a new proof technique combining nonlinear analysis with properties of random initializations of the network. First, we establish the global convergence of continuous solutions of the differential inclusion being a nonsmooth analogue of the gradient flow for the MSE loss. Second, we provide a technical result (working also for general approximators) relating solutions of the aforementioned differential inclusion to the (discrete) stochastic gradient descent sequences, hence establishing linear convergence towards zero loss for the stochastic gradient descent iterations.

This study introduces a novel computational framework for Robust Topology Optimization (RTO) considering imprecise random field parameters. Unlike the worst-case approach, the present method provides upper and lower bounds for the mean and standard deviation of compliance as well as the optimized topological layouts of a structure for various scenarios. In the proposed approach, the imprecise random field variables are determined utilizing parameterized p-boxes with different confidence intervals. The Karhunen-Lo\`eve (K-L) expansion is extended to provide a spectral description of the imprecise random field. The linear superposition method in conjunction with a linear combination of orthogonal functions is employed to obtain explicit mathematical expressions for the first and second order statistical moments of the structural compliance. Then, an interval sensitivity analysis is carried out, applying the Orthogonal Similarity Transformation (OST) method with the boundaries of each of the intermediate variable searched efficiently at every iteration using a Combinatorial Approach (CA). Finally, the validity, accuracy, and applicability of the work are rigorously checked by comparing the outputs of the proposed approach with those obtained using the particle swarm optimization (PSO) and Quasi-Monte-Carlo Simulation (QMCS) methods. Three different numerical examples with imprecise random field loads are presented to show the effectiveness and feasibility of the study.

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of agent interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent's optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution to Nash equilibrium. Experiments on Gaussian squeeze, Ising model, and battle games justify the learning effectiveness of our mean field approaches. In addition, we report the first result to solve the Ising model via model-free reinforcement learning methods.

北京阿比特科技有限公司