夏娃韩剧电视剧在剧免费韩剧TV_国产亚洲一区二区三区在线_久久精品国产99国产精品76_国产精品无遮挡一级毛片视频_国产99RE视频在线观看_一区精品二区国产三区日韩_国内精品久久久久影院大陆

We consider a combinatorial multi-armed bandit problem for maximum value reward function under maximum value and index feedback. This is a new feedback structure that lies in between commonly studied semi-bandit and full-bandit feedback structures. We propose an algorithm and provide a regret bound for problem instances with stochastic arm outcomes according to arbitrary distributions with finite supports. The regret analysis rests on considering an extended set of arms, associated with values and probabilities of arm outcomes, and applying a smoothness condition. Our algorithm achieves a $O((k/\Delta)\log(T))$ distribution-dependent and a $\tilde{O}(\sqrt{T})$ distribution-independent regret where $k$ is the number of arms selected in each round, $\Delta$ is a distribution-dependent reward gap and $T$ is the horizon time. Perhaps surprisingly, the regret bound is comparable to previously-known bound under more informative semi-bandit feedback. We demonstrate the effectiveness of our algorithm through experimental results.

相關內容

賭博機(ji)/老虎機(ji)

關注 0

賭博機/老虎機 · 約束 · 優化器 · 在線 · CASE ·

2023 年 7 月 13 日

Online Convex Optimization with Stochastic Constraints: Zero Constraint Violation and Bandit Feedback

Yeongjong Kim,Dabeen Lee

from arxiv, We found a paper that has already obtained the results of the submission

This paper studies online convex optimization with stochastic constraints. We propose a variant of the drift-plus-penalty algorithm that guarantees $O(\sqrt{T})$ expected regret and zero constraint violation, after a fixed number of iterations, which improves the vanilla drift-plus-penalty method with $O(\sqrt{T})$ constraint violation. Our algorithm is oblivious to the length of the time horizon $T$, in contrast to the vanilla drift-plus-penalty method. This is based on our novel drift lemma that provides time-varying bounds on the virtual queue drift and, as a result, leads to time-varying bounds on the expected virtual queue length. Moreover, we extend our framework to stochastic-constrained online convex optimization under two-point bandit feedback. We show that by adapting our algorithmic framework to the bandit feedback setting, we may still achieve $O(\sqrt{T})$ expected regret and zero constraint violation, improving upon the previous work for the case of identical constraint functions. Numerical results demonstrate our theoretical results.

估計/估計量 · 分解的 · 有偏 · 方差 · 相互獨立的 ·

2023 年 7 月 13 日

Leveraging Factored Action Spaces for Off-Policy Evaluation

Aaman Rebello,Shengpu Tang,Jenna Wiens,Sonali Parbhoo

from arxiv, Main paper: 8 pages, 7 figures. Appendix: 30 pages, 17 figures. Accepted at ICML 2023 Workshop on Counterfactuals in Minds and Machines, Honolulu, Hawaii, USA. Camera ready version

Off-policy evaluation (OPE) aims to estimate the benefit of following a counterfactual sequence of actions, given data collected from executed sequences. However, existing OPE estimators often exhibit high bias and high variance in problems involving large, combinatorial action spaces. We investigate how to mitigate this issue using factored action spaces i.e. expressing each action as a combination of independent sub-actions from smaller action spaces. This approach facilitates a finer-grained analysis of how actions differ in their effects. In this work, we propose a new family of "decomposed" importance sampling (IS) estimators based on factored action spaces. Given certain assumptions on the underlying problem structure, we prove that the decomposed IS estimators have less variance than their original non-decomposed versions, while preserving the property of zero bias. Through simulations, we empirically verify our theoretical results, probing the validity of various assumptions. Provided with a technique that can derive the action space factorisation for a given problem, our work shows that OPE can be improved "for free" by utilising this inherent problem structure.

估計/估計量 · MoDELS · 泛函 · 樣例 · 生成方法 ·

2023 年 7 月 13 日

Higher Order Estimating Equations for High-dimensional Models

James Robins,Lingling Li,Rajarshi Mukherjee,Eric Tchetgen Tchetgen,Aad van der Vaart

We introduce a new method of estimation of parameters in semiparametric and nonparametric models. The method is based on estimating equations that are $U$-statistics in the observations. The $U$-statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the representation cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than $\sqrt n$-rate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at $\sqrt n$-rate, but we also consider efficient $\sqrt n$-estimation using novel nonlinear estimators. The general approach is applied in detail to the example of estimating a mean response when the response is not always observed.

逆強化學習 · Learning · Analysis · 強化學習 · Principle ·

2023 年 7 月 13 日

On the Effective Horizon of Inverse Reinforcement Learning

Yiqing Xu,Finale Doshi-Velez,David Hsu

from arxiv, 9 pages, under review

Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning over a given time horizon to compute an approximately optimal policy for a hypothesized reward function and then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimate and the computational efficiency of IRL algorithms. Interestingly, an effective time horizon shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis leads to a principled choice of the effective horizon for IRL. It also prompts us to reexamine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon together rather than the reward alone with a given horizon. Our experimental results confirm the theoretical analysis.

可約的 · Agent · Learning · Guidance · 知識 (knowledge) ·

2023 年 7 月 12 日

Probabilistic Counterexample Guidance for Safer Reinforcement Learning (Extended Version)

Xiaotong Ji,Antonio Filieri

from arxiv, Accepted and Evaluated by the 20th International Conference on Quantitative Evaluation of Systems 2023

Safe exploration aims at addressing the limitations of Reinforcement Learning (RL) in safety-critical scenarios, where failures during trial-and-error learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, reducing exploration risks in unknown environments, where an agent must discover safety threats during exploration, remains challenging. In this paper, we target the problem of safe exploration by guiding the training with counterexamples of the safety requirement. Our method abstracts both continuous and discrete state-space systems into compact abstract models representing the safety-relevant knowledge acquired by the agent during exploration. We then exploit probabilistic counterexample generation to construct minimal simulation submodels eliciting safety requirement violations, where the agent can efficiently train offline to refine its policy towards minimising the risk of safety violations during the subsequent online exploration. We demonstrate our method's effectiveness in reducing safety violations during online exploration in preliminary experiments by an average of 40.3% compared with QL and DQN standard algorithms and 29.1% compared with previous related work, while achieving comparable cumulative rewards with respect to unrestricted exploration and alternative approaches.

Weight · 無向 · 近似 · 確切的 · 圖 ·

2023 年 7 月 12 日

On the cut-query complexity of approximating max-cut

Orestis Plevrakis,Seyoon Ragavan,S. Matthew Weinberg

We consider the problem of query-efficient global max-cut on a weighted undirected graph in the value oracle model examined by [RSW18]. This model arises as a natural special case of submodular function maximization: on query $S \subseteq V$, the oracle returns the total weight of the cut between $S$ and $V \backslash S$. For most constants $c \in (0,1]$, we nail down the query complexity of achieving a $c$-approximation, for both deterministic and randomized algorithms (up to logarithmic factors). Analogously to general submodular function maximization in the same model, we observe a phase transition at $c = 1/2$: we design a deterministic algorithm for global $c$-approximate max-cut in $O(\log n)$ queries for any $c < 1/2$, and show that any randomized algorithm requires $\tilde{\Omega}(n)$ queries to find a $c$-approximate max-cut for any $c > 1/2$. Additionally, we show that any deterministic algorithm requires $\Omega(n^2)$ queries to find an exact max-cut (enough to learn the entire graph), and develop a $\tilde{O}(n)$-query randomized $c$-approximation for any $c < 1$. Our approach provides two technical contributions that may be of independent interest. One is a query-efficient sparsifier for undirected weighted graphs (prior work of [RSW18] holds only for unweighted graphs). Another is an extension of the cut dimension to rule out approximation (prior work of [GPRW20] introducing the cut dimension only rules out exact solutions).

Markov · 優化器 · Principle · 情景 · Learning ·

2023 年 7 月 12 日

On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process

Rahul Misra,Rafa? Wisniewski,Carsten Skovmose Kalles?e

We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample due to Haviv. We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm). Finally, we consider the reinforcement learning problem for the same and construct a modified $Q$-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.

泛函 · 線性的 · 回合 · FAST · CASE ·

2023 年 7 月 12 日

Self-adjusting Population Sizes for the $(1, λ)$-EA on Monotone Functions

Marc Kaufmann,Maxime Larcher,Johannes Lengler,Xun Zou

We study the $(1,\lambda)$-EA with mutation rate $c/n$ for $c\le 1$, where the population size is adaptively controlled with the $(1:s+1)$-success rule. Recently, Hevia Fajardo and Sudholt have shown that this setup with $c=1$ is efficient on \onemax for $s<1$, but inefficient if $s \ge 18$. Surprisingly, the hardest part is not close to the optimum, but rather at linear distance. We show that this behavior is not specific to \onemax. If $s$ is small, then the algorithm is efficient on all monotone functions, and if $s$ is large, then it needs superpolynomial time on all monotone functions. In the former case, for $c<1$ we show a $O(n)$ upper bound for the number of generations and $O(n\log n)$ for the number of function evaluations, and for $c=1$ we show $O(n\log n)$ generations and $O(n^2\log\log n)$ evaluations. We also show formally that optimization is always fast, regardless of $s$, if the algorithm starts in proximity of the optimum. All results also hold in a dynamic environment where the fitness function changes in each generation.

Prophet · 優化器 · 估計/估計量 · 線性的 · 閾值 ·

2023 年 7 月 12 日

Reward Selection with Noisy Observations

Kamyar Azizzadenesheli,Trung Dang,Aranyak Mehta,Alexandros Psomas,Qian Zhang

We study a fundamental problem in optimization under uncertainty. There are $n$ boxes; each box $i$ contains a hidden reward $x_i$. Rewards are drawn i.i.d. from an unknown distribution $\mathcal{D}$. For each box $i$, we see $y_i$, an unbiased estimate of its reward, which is drawn from a Normal distribution with known standard deviation $\sigma_i$ (and an unknown mean $x_i$). Our task is to select a single box, with the goal of maximizing our reward. This problem captures a wide range of applications, e.g. ad auctions, where the hidden reward is the click-through rate of an ad. Previous work in this model [BKMR12] proves that the naive policy, which selects the box with the largest estimate $y_i$, is suboptimal, and suggests a linear policy, which selects the box $i$ with the largest $y_i - c \cdot \sigma_i$, for some $c > 0$. However, no formal guarantees are given about the performance of either policy (e.g., whether their expected reward is within some factor of the optimal policy's reward). In this work, we prove that both the naive policy and the linear policy are arbitrarily bad compared to the optimal policy, even when $\mathcal{D}$ is well-behaved, e.g. has monotone hazard rate (MHR), and even under a "small tail" condition, which requires that not too many boxes have arbitrarily large noise. On the flip side, we propose a simple threshold policy that gives a constant approximation to the reward of a prophet (who knows the realized values $x_1, \dots, x_n$) under the same "small tail" condition. We prove that when this condition is not satisfied, even an optimal clairvoyant policy (that knows $\mathcal{D}$) cannot get a constant approximation to the prophet, even for MHR distributions, implying that our threshold policy is optimal against the prophet benchmark, up to constants.

Subspace · 層 · 線性的 · 神經元 · Networking ·

2023 年 7 月 11 日

A neuron-wise subspace correction method for the finite neuron method

Jongho Park,Jinchao Xu,Xiaofeng Xu

from arxiv, 23 pages, 6 figures

In this paper, we propose a novel algorithm called Neuron-wise Parallel Subspace Correction Method (NPSC) for the finite neuron method that approximates numerical solutions of partial differential equations (PDEs) using neural network functions. Despite extremely extensive research activities in applying neural networks for numerical PDEs, there is still a serious lack of effective training algorithms that can achieve adequate accuracy, even for one-dimensional problems. Based on recent results on the spectral properties of linear layers and landscape analysis for single neuron problems, we develop a special type of subspace correction method that optimizes the linear layer and each neuron in the nonlinear layer separately. An optimal preconditioner that resolves the ill-conditioning of the linear layer is presented for one-dimensional problems, so that the linear layer is trained in a uniform number of iterations with respect to the number of neurons. In each single neuron problem, a good local minimum that avoids flat energy regions is found by a superlinearly convergent algorithm. Numerical experiments on function approximation problems and PDEs demonstrate better performance of the proposed method than other gradient-based methods.