亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The use of deep neural networks has been highly successful in reinforcement learning and control, although few theoretical guarantees for deep learning exist for these problems. There are two main challenges for deriving performance guarantees: a) control has state information and thus is inherently online and b) deep networks are non-convex predictors for which online learning cannot provide provable guarantees in general. Building on the linearization technique for overparameterized neural networks, we derive provable regret bounds for efficient online learning with deep neural networks. Specifically, we show that over any sequence of convex loss functions, any low-regret algorithm can be adapted to optimize the parameters of a neural network such that it competes with the best net in hindsight. As an application of these results in the online setting, we obtain provable bounds for online episodic control with deep neural network controllers.

相關內容

神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)(Neural Networks)是世界上三(san)個(ge)最古老的(de)(de)(de)(de)神經(jing)(jing)建模學(xue)(xue)(xue)(xue)(xue)會(hui)的(de)(de)(de)(de)檔(dang)案期刊:國際神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(INNS)、歐洲神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(ENNS)和(he)(he)(he)日本神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)學(xue)(xue)(xue)(xue)(xue)會(hui)(JNNS)。神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)提供了(le)一個(ge)論(lun)(lun)壇,以(yi)(yi)發(fa)(fa)(fa)展(zhan)和(he)(he)(he)培(pei)育一個(ge)國際社(she)會(hui)的(de)(de)(de)(de)學(xue)(xue)(xue)(xue)(xue)者(zhe)和(he)(he)(he)實踐者(zhe)感興(xing)趣的(de)(de)(de)(de)所(suo)有(you)方面(mian)的(de)(de)(de)(de)神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)和(he)(he)(he)相關(guan)方法(fa)的(de)(de)(de)(de)計(ji)(ji)算(suan)智能。神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)歡(huan)迎高質量(liang)論(lun)(lun)文(wen)的(de)(de)(de)(de)提交,有(you)助于全面(mian)的(de)(de)(de)(de)神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)研究,從(cong)行為和(he)(he)(he)大腦建模,學(xue)(xue)(xue)(xue)(xue)習(xi)算(suan)法(fa),通過(guo)數學(xue)(xue)(xue)(xue)(xue)和(he)(he)(he)計(ji)(ji)算(suan)分(fen)(fen)析(xi),系統(tong)的(de)(de)(de)(de)工程和(he)(he)(he)技術(shu)應用,大量(liang)使用神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)的(de)(de)(de)(de)概(gai)念和(he)(he)(he)技術(shu)。這一獨特而廣泛(fan)的(de)(de)(de)(de)范圍促進(jin)了(le)生物和(he)(he)(he)技術(shu)研究之(zhi)間(jian)的(de)(de)(de)(de)思想交流,并有(you)助于促進(jin)對(dui)生物啟發(fa)(fa)(fa)的(de)(de)(de)(de)計(ji)(ji)算(suan)智能感興(xing)趣的(de)(de)(de)(de)跨學(xue)(xue)(xue)(xue)(xue)科社(she)區(qu)的(de)(de)(de)(de)發(fa)(fa)(fa)展(zhan)。因此,神經(jing)(jing)網(wang)(wang)(wang)絡(luo)(luo)(luo)編(bian)(bian)委會(hui)代表(biao)的(de)(de)(de)(de)專家領域包括(kuo)心理學(xue)(xue)(xue)(xue)(xue),神經(jing)(jing)生物學(xue)(xue)(xue)(xue)(xue),計(ji)(ji)算(suan)機科學(xue)(xue)(xue)(xue)(xue),工程,數學(xue)(xue)(xue)(xue)(xue),物理。該雜(za)志發(fa)(fa)(fa)表(biao)文(wen)章、信件和(he)(he)(he)評論(lun)(lun)以(yi)(yi)及(ji)給編(bian)(bian)輯的(de)(de)(de)(de)信件、社(she)論(lun)(lun)、時事、軟件調查和(he)(he)(he)專利信息(xi)。文(wen)章發(fa)(fa)(fa)表(biao)在五(wu)個(ge)部(bu)分(fen)(fen)之(zhi)一:認知科學(xue)(xue)(xue)(xue)(xue),神經(jing)(jing)科學(xue)(xue)(xue)(xue)(xue),學(xue)(xue)(xue)(xue)(xue)習(xi)系統(tong),數學(xue)(xue)(xue)(xue)(xue)和(he)(he)(he)計(ji)(ji)算(suan)分(fen)(fen)析(xi)、工程和(he)(he)(he)應用。 官網(wang)(wang)(wang)地址:

In many areas, such as the physical sciences, life sciences, and finance, control approaches are used to achieve a desired goal in complex dynamical systems governed by differential equations. In this work we formulate the problem of controlling stochastic partial differential equations (SPDE) as a reinforcement learning problem. We present a learning-based, distributed control approach for online control of a system of SPDEs with high dimensional state-action space using deep deterministic policy gradient method. We tested the performance of our method on the problem of controlling the stochastic Burgers' equation, describing a turbulent fluid flow in an infinitely large domain.

A commonly used heuristic in RL is experience replay (e.g.~\citet{lin1993reinforcement, mnih2015human}), in which a learner stores and re-uses past trajectories as if they were sampled online. In this work, we initiate a rigorous study of this heuristic in the setting of tabular Q-learning. We provide a convergence rate guarantee, and discuss how it compares to the convergence of Q-learning depending on important parameters such as the frequency and number of replay iterations. We also provide theoretical evidence showing when we might expect this heuristic to strictly improve performance, by introducing and analyzing a simple class of MDPs. Finally, we provide some experiments to support our theoretical findings.

A natural goal when designing online learning algorithms for non-stationary environments is to bound the regret of the algorithm in terms of the temporal variation of the input sequence. Intuitively, when the variation is small, it should be easier for the algorithm to achieve low regret, since past observations are predictive of future inputs. Such data-dependent "pathlength" regret bounds have recently been obtained for a wide variety of online learning problems, including OCO and bandits. We obtain the first pathlength regret bounds for online control and estimation (e.g. Kalman filtering) in linear dynamical systems. The key idea in our derivation is to reduce pathlength-optimal filtering and control to certain variational problems in robust estimation and control; these reductions may be of independent interest. Numerical simulations confirm that our pathlength-optimal algorithms outperform traditional $H_2$ and $H_{\infty}$ algorithms when the environment varies over time.

The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.

Policy learning is a quickly growing area. As robotics and computers control day-to-day life, their error rate needs to be minimized and controlled. There are many policy learning methods and provable error rates that accompany them. We show an error or regret bound and convergence of the Deep Epsilon Greedy method which chooses actions with a neural network's prediction. In experiments with the real-world dataset MNIST, we construct a nonlinear reinforcement learning problem. We witness how with either high or low noise, some methods do and some do not converge which agrees with our proof of convergence.

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are dealt through convex surrogate losses and through randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. These notes do not require prior knowledge of convex analysis and all the required mathematical tools are rigorously explained. Moreover, all the proofs have been carefully chosen to be as simple and as short as possible.

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

This work considers the problem of provably optimal reinforcement learning for episodic finite horizon MDPs, i.e. how an agent learns to maximize his/her long term reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP, $S$ is the number of states, $A$ is the number of actions, and $H$ is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve $\epsilon$-average regret for any constant $\epsilon$ is $O(SA)$, which is a number of samples that is far less than that required to learn any non-trivial estimate of the transition model (the transition model is specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as $Q$-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the $\Omega(\sqrt{HSAT})$ lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in $Q$-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the asymptotically optimal regret.

北京阿比特科技有限公司