亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

<dir id='1vulq'><del id='YRyvY'><del id='9kDlr'></del><pre id='iXYFT'><pre id='WV8pg'><option id='k18d7'><address id='ckb8C'></address><bdo id='u3QZt'><tr id='lEHNU'><acronym id='1qUQM'><pre id='X6goF'></pre></acronym><div id='exfff'></div></tr></bdo></option></pre><small id='8TvyV'><address id='YiFS8'><u id='qg0lH'><legend id='rouEq'><option id='5gqAi'><abbr id='hdXCy'></abbr><li id='8XLTw'><pre id='Bii27'></pre></li></option></legend><select id='RUcTw'></select></u></address></small></pre></del><sup id='R65Qn'></sup><blockquote id='1dLZk'><dt id='1D6pN'></dt></blockquote><blockquote id='Uudua'></blockquote></dir><tt id='DFpZg'></tt><u id='jfCII'><tt id='keHdB'><form id='mjhoH'></form></tt><td id='5mVIF'><dt id='9n3ol'></dt></td></u>

<code id='3Aq4T'><i id='lrZ8M'><q id='XBrTn'><legend id='WLFhW'><pre id='0qfnL'><style id='v8gIp'><acronym id='wywOp'><i id='kf8YO'><form id='fLOr5'><option id='JkL05'><center id='viVtn'></center></option></form></i></acronym></style><tt id='Mvdry'></tt></pre></legend></q></i></code><center id='nkqSq'></center>

<dd id='b1fkU'></dd>

<style id='PKQTK'></style><sub id='ZCfzi'><dfn id='bhL07'><abbr id='2jZHE'><big id='9cHES'><bdo id='FKmLw'></bdo></big></abbr></dfn></sub>_{<dir id='oqEd5'></dir>}

·

ARM · 賭博機/老虎機 · 樣本 · Weight · 目標函數 ·

2021 年 11 月 24 日

Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration Sampling

Kaito Ariu,Masahiro Kato,Junpei Komiyama,Kenichiro McAlinn,Chao Qin

from arxiv, Submitted to Econometrica

We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design. Theorem 1 of Kasy and Sautmann (2021) provides three asymptotic results that give theoretical guarantees for exploration sampling developed for this setting. We first show that the proof of Theorem 1 (1) has technical issues, and the proof and statement of Theorem 1 (2) are incorrect. We then show, through a counterexample, that Theorem 1 (3) is false. For the former two, we correct the statements and provide rigorous proofs. For Theorem 1 (3), we propose an alternative objective function, which we call posterior weighted policy regret, and derive the asymptotic optimality of exploration sampling.

相關內容

ARM

安謀控股公(gong)(gong)司(si)，又稱ARM公(gong)(gong)司(si)，跨國性半導體設計與軟(ruan)件(jian)公(gong)(gong)司(si)，總部位于英國英格蘭劍橋。主(zhu)要的產(chan)(chan)品(pin)是(shi)ARM架構(gou)處理器的設計，將其以(yi)知識產(chan)(chan)權的形式向(xiang)客戶進行授權，同時也提供軟(ruan)件(jian)開發(fa)工具(ju)。

樣本復雜度 · 優化器 · 正則化 · 樣本 · 價值函數 ·

2022 年 1 月 27 日

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

Yan Li,Tuo Zhao,Guanghui Lan

We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report three properties that seem to be new in the literature of policy gradient methods: (1) The policy first converges linearly, then superlinearly with order $\gamma^{-2}$ to the set of optimal policies, after $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function; (2) HPMD also exhibits last-iterate convergence, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state. No regularization is added to the optimization objective and hence the second observation arises solely as an algorithmic property of the homotopic policy gradient method. (3) For the stochastic HPMD method, we further demonstrate a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| / \epsilon^2)$ sample complexity for small optimality gap $\epsilon$, when assuming a generative model for policy evaluation.

Performer · 價值函數 · 估計/估計量 · 優化器 · MoDELS ·

2022 年 1 月 27 日

COMBO: Conservative Offline Model-Based Policy Optimization

Tianhe Yu,Aviral Kumar,Rafael Rafailov,Aravind Rajeswaran,Sergey Levine,Chelsea Finn

from arxiv, NeurIPS 2021

Model-based algorithms, which learn a dynamics model from logged experience and perform some sort of pessimistic planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating pessimism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. We theoretically show that our method optimizes a lower bound on the true policy value, that this bound is tighter than that of prior methods, and our approach satisfies a policy improvement guarantee in the offline setting. Through experiments, we find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods on widely studied offline RL benchmarks, including image-based tasks.

CC · 樣本 · 離散化 · 散度 · 相互獨立的 ·

2022 年 1 月 26 日

Complexity of zigzag sampling algorithm for strongly log-concave distributions

Jianfeng Lu,Lihan Wang

from arxiv, We added a new discussion that the warmness assumption of our main theorem can be achieved using LMC

We study the computational complexity of zigzag sampling algorithm for strongly log-concave distributions. The zigzag process has the advantage of not requiring time discretization for implementation, and that each proposed bouncing event requires only one evaluation of partial derivative of the potential, while its convergence rate is dimension independent. Using these properties, we prove that the zigzag sampling algorithm achieves $\varepsilon$ error in chi-square divergence with a computational cost equivalent to $O\bigl(\kappa^2 d^\frac{1}{2}(\log\frac{1}{\varepsilon})^{\frac{3}{2}}\bigr)$ gradient evaluations in the regime $\kappa \ll \frac{d}{\log d}$ under a warm start assumption, where $\kappa$ is the condition number and $d$ is the dimension.

圖 · 情景 · 閾值 · 易處理的 · 極小點 ·

2022 年 1 月 26 日

The Harmless Set Problem

Ajinkya Gaikwad,Soumen Maity

Given a graph $G = (V,E)$, a threshold function $t~ :~ V \rightarrow \mathbb{N}$ and an integer $k$, we study the Harmless Set problem, where the goal is to find a subset of vertices $S \subseteq V$ of size at least $k$ such that every vertex $v\in V$ has less than $t(v)$ neighbors in $S$. We enhance our understanding of the problem from the viewpoint of parameterized complexity. Our focus lies on parameters that measure the structural properties of the input instance. We show that the problem is W[1]-hard parameterized by a wide range of fairly restrictive structural parameters such as the feedback vertex set number, pathwidth, treedepth, and even the size of a minimum vertex deletion set into graphs of pathwidth and treedepth at most three. On dense graphs, we show that the problem is W[1]-hard parameterized by cluster vertex deletion number. We also show that the Harmless Set problem with majority thresholds is W[1]-hard when parameterized by the treewidth of the input graph. We prove that the Harmless Set problem can be solved in polynomial time on graph with bounded cliquewidth. On the positive side, we obtain fixed-parameter algorithms for the problem with respect to neighbourhood diversity, twin cover and vertex integrity of the input graph. We show that the problem parameterized by the solution size is fixed parameter tractable on planar graphs. We thereby resolve two open questions stated in C. Bazgan and M. Chopin (2014) concerning the complexity of {\sc Harmless Set} parameterized by the treewidth of the input graph and on planar graphs with respect to the solution size.

估計/估計量 · MoDELS · 均方誤差 · 自助法/自舉法 · 穩健性 ·

2022 年 1 月 25 日

A Nested Error Regression Model with High Dimensional Parameter for Small Area Estimation

Partha Lahiri,Nicola Salvati

In this paper we propose a flexible nested error regression small area model with high dimensional parameter that incorporates heterogeneity in regression coefficients and variance components. We develop a new robust small area specific estimating equations method that allows appropriate pooling of a large number of areas in estimating small area specific model parameters. We propose a parametric bootstrap and jackknife method to estimate not only the mean squared errors but also other commonly used uncertainty measures such as standard errors and coefficients of variation. We conduct both modelbased and design-based simulation experiments and real-life data analysis to evaluate the proposed methodology

樣本復雜度 · ARM · 優化器 · Extensibility · 樣本 ·

2022 年 1 月 25 日

Almost Optimal Variance-Constrained Best Arm Identification

Yunlong Hou,Vincent Y. F. Tan,Zixin Zhong

from arxiv, 44 pages, 15 figures

We design and analyze VA-LUCB, a parameter-free algorithm, for identifying the best arm under the fixed-confidence setup and under a stringent constraint that the variance of the chosen arm is strictly smaller than a given threshold. An upper bound on VA-LUCB's sample complexity is shown to be characterized by a fundamental variance-aware hardness quantity $H_{VA}$. By proving a lower bound, we show that sample complexity of VA-LUCB is optimal up to a factor logarithmic in $H_{VA}$. Extensive experiments corroborate the dependence of the sample complexity on the various terms in $H_{VA}$. By comparing VA-LUCB's empirical performance to a close competitor RiskAverse-UCB-BAI by David et al. (2018), our experiments suggest that VA-LUCB has the lowest sample complexity for this class of risk-constrained best arm identification problems, especially for the riskiest instances.

方差 · PG · 可約的 · Performer · 估計/估計量 ·

2021 年 8 月 20 日

Settling the Variance of Multi-Agent Policy Gradients

Jakub Grudzien Kuba,Muning Wen,Yaodong Yang,Linghui Meng,Shangding Gu,Haifeng Zhang,David Henry Mguni,Jun Wang

Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper, we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB) that achieves the minimal variance. In comparison to the OB, we measure the excess variance of existing MARL algorithms such as vanilla MAPG and COMA. Considering using deep neural networks, we also propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL. On benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique effectively stabilises training and improves the performance of multi-agent PPO and COMA algorithms by a significant margin.

優化器 · 值迭代 · CASES · 估計/估計量 · 路徑 ·

2021 年 4 月 22 日

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Jean Tarbouriech,Runlong Zhou,Simon S. Du,Matteo Pirotta,Michal Valko,Alessandro Lazaric

We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$ which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.

優化器 · 強化學習 · 學成 · state-of-the-art · SimPLe ·

2018 年 7 月 25 日

Variational Bayesian Reinforcement Learning with Regret Bounds

Brendan O'Donoghue

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

Performer · 估計/估計量 · 經驗風險最小化 · 經驗風險 · 方差 ·

2017 年 12 月 14 日

Variance-based regularization with convex objectives

John Duchi,Hongseok Namkoong

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

閱讀: 0 點贊: 0

小貼士

登錄享

相關主題

賭(du)博機(ji)/老虎機(ji)

目(mu)標(biao)函數

北京阿比特科技有限公司

注冊地址：北京市海淀區羊坊店路18號2幢3層301-191

<tr id='fuuag'><strong id='fuuag'></strong><small id='fuuag'></small><button id='fuuag'></button><li id='fuuag'><noscript id='fuuag'><big id='fuuag'></big><dt id='fuuag'></dt></noscript></li></tr><ol id='fuuag'><option id='fuuag'><table id='fuuag'><blockquote id='fuuag'><tbody id='fuuag'></tbody></blockquote></table></option></ol><u id='fuuag'></u><kbd id='fuuag'><kbd id='fuuag'></kbd></kbd>

<code id='fuuag'><strong id='fuuag'></strong></code>

<fieldset id='fuuag'></fieldset>

<span id='fuuag'></span>

<ins id='fuuag'></ins>

<acronym id='fuuag'><em id='fuuag'></em><td id='fuuag'><div id='fuuag'></div></td></acronym><address id='fuuag'><big id='fuuag'><big id='fuuag'></big><legend id='fuuag'></legend></big></address>

<i id='fuuag'><div id='fuuag'><ins id='fuuag'></ins></div></i>

<i id='fuuag'></i>