亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

We study a general Markov game with metric switching costs: in each round, the player adaptively chooses one of several Markov chains to advance with the objective of minimizing the expected cost for at least $k$ chains to reach their target states. If the player decides to play a different chain, an additional switching cost is incurred. The special case in which there is no switching cost was solved optimally by Dumitriu, Tetali, and Winkler~\cite{DTW03} by a variant of the celebrated Gittins Index for the classical multi-armed bandit (MAB) problem with Markovian rewards \cite{Git74,Git79}. However, for Markovian multi-armed bandit with nontrivial switching cost, even if the switching cost is a constant, the classic paper by Banks and Sundaram \cite{BS94} showed that no index strategy can be optimal. In this paper, we complement their result and show there is a simple index strategy that achieves a constant approximation factor if the switching cost is constant and $k=1$. To the best of our knowledge, this index strategy is the first strategy that achieves a constant approximation factor for a general Markovian MAB variant with switching costs. For the general metric, we propose a more involved constant-factor approximation algorithm, via a nontrivial reduction to the stochastic $k$-TSP problem, in which a Markov chain is approximated by a random variable. Our analysis makes extensive use of various interesting properties of the Gittins index.

相關內容

馬(ma)(ma)爾可(ke)(ke)夫(fu)鏈,因安德烈·馬(ma)(ma)爾可(ke)(ke)夫(fu)(A.A.Markov,1856-1922)得名(ming),是(shi)指數學中(zhong)具有馬(ma)(ma)爾可(ke)(ke)夫(fu)性質的(de)(de)(de)(de)離散事件隨機過程。該(gai)過程中(zhong),在給定當(dang)前(qian)知識(shi)或信(xin)息的(de)(de)(de)(de)情況下,過去(qu)(即(ji)當(dang)前(qian)以前(qian)的(de)(de)(de)(de)歷史狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai))對(dui)于預測將來(即(ji)當(dang)前(qian)以后的(de)(de)(de)(de)未來狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai))是(shi)無關的(de)(de)(de)(de)。 在馬(ma)(ma)爾可(ke)(ke)夫(fu)鏈的(de)(de)(de)(de)每一(yi)步(bu),系統根據概率(lv)分布,可(ke)(ke)以從一(yi)個(ge)狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai)變(bian)(bian)(bian)到另一(yi)個(ge)狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai),也可(ke)(ke)以保持(chi)當(dang)前(qian)狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai)。狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai)的(de)(de)(de)(de)改(gai)變(bian)(bian)(bian)叫做轉(zhuan)移(yi),與不同的(de)(de)(de)(de)狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai)改(gai)變(bian)(bian)(bian)相關的(de)(de)(de)(de)概率(lv)叫做轉(zhuan)移(yi)概率(lv)。隨機漫(man)(man)步(bu)就是(shi)馬(ma)(ma)爾可(ke)(ke)夫(fu)鏈的(de)(de)(de)(de)例子。隨機漫(man)(man)步(bu)中(zhong)每一(yi)步(bu)的(de)(de)(de)(de)狀(zhuang)(zhuang)(zhuang)(zhuang)(zhuang)態(tai)(tai)是(shi)在圖形中(zhong)的(de)(de)(de)(de)點(dian),每一(yi)步(bu)可(ke)(ke)以移(yi)動到任何(he)一(yi)個(ge)相鄰的(de)(de)(de)(de)點(dian),在這里移(yi)動到每一(yi)個(ge)點(dian)的(de)(de)(de)(de)概率(lv)都是(shi)相同的(de)(de)(de)(de)(無論之前(qian)漫(man)(man)步(bu)路徑(jing)是(shi)如何(he)的(de)(de)(de)(de))。

With the gradual advancement of a novel idea of the distributed control of the multiagent systems, an event-triggered control protocol has received significant research attention, especially in designing the controller for the nonlinear multiagent system. Compared to other widely used control conditions, the event-triggered control of the nonlinear system has a significant capability to improve resource utilization in real-life scenarios such as using and controlling the intelligent control input of each agent. It is worth mentioning that a group of interconnected agents have a network communication topology to transmit the feedback information state across the networked link. The transmission of information among a group of agents ensures that each agent reaches the consensus agreement cooperatively. The cooperative protocol of the distributed control of nonlinear multiagent system also ensures the proper information flow between each agent, irrespective of communication delays, variability of environment, and switching of the communication topology via the event-triggered control protocol. Consequently, event-triggered control for nonlinear multi-agent systems via steady-state performance will be investigated in this paper. The steady-state performances of a nonlinear closed-loop system demonstrate the stabilization, output regulation, and output synchronization problem of the nonlinear system using proper control protocol to achieve a consensus in a multiagent system will also be discussed. Based on the steady-state conditions of the nonlinear system, the consensus agreement among the agents will be realized.

We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.

Consider $n$ iid real-valued random vectors of size $k$ having iid coordinates with a general distribution function $F$. A vector is a maximum if and only if there is no other vector in the sample which weakly dominates it in all coordinates. Let $p_{k,n}$ be the probability that the first vector is a maximum. The main result of the present paper is that if $k\equiv k_n$ is growing at a slower (faster) rate than a certain factor of $\log(n)$, then $p_{k,n} \rightarrow 0$ (resp. $p_{k,n}\rightarrow1$) as $n\to\infty$. Furthermore, the factor is fully characterized as a functional of $F$. We also study the effect of $F$ on $p_{k,n}$, showing that while $p_{k,n}$ may be highly affected by the choice of $F$, the phase transition is the same for all distribution functions up to a constant factor.

A high-order finite element method is proposed to solve the nonlinear convection-diffusion equation on a time-varying domain whose boundary is implicitly driven by the solution of the equation. The method is semi-implicit in the sense that the boundary is traced explicitly with a high-order surface-tracking algorithm, while the convection-diffusion equation is solved implicitly with high-order backward differentiation formulas and fictitious-domain finite element methods. By two numerical experiments for severely deforming domains, we show that optimal convergence orders are obtained in energy norm for third-order and fourth-order methods.

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.

This paper is devoted to a new first order Taylor-like formula where the corresponding remainder is strongly reduced in comparison with the usual one which which appears in the classical Taylor's formula. To derive this new formula, we introduce a linear combination of the first derivatives of the concerned function which are computed at $n+1$ equally spaced points between the two points where the function has to be evaluated. Therefore, we show that an optimal choice of the weights of the linear combination leads to minimize the corresponding remainder. Then, we analyze the Lagrange $P_1$- interpolation error estimate and also the trapezoidal quadrature error to assess the gain of accuracy we get due to this new Taylor-like formula.

Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this paper proposes Model-based Adversarial Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case sub-optimality gap -- the difference between the optimal return and the return that the algorithm achieves after adaptation -- across all tasks in a family of tasks, with a model-based approach. We propose a minimax objective and optimize it by alternating between learning the dynamics model on a fixed task and finding the adversarial task for the current model -- the task for which the policy induced by the model is maximally suboptimal. Assuming the family of tasks is parameterized, we derive a formula for the gradient of the suboptimality with respect to the task parameters via the implicit function theorem, and show how the gradient estimator can be efficiently implemented by the conjugate gradient method and a novel use of the REINFORCE estimator. We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks, the generalization power to out-of-distribution tasks, and in training and test time sample efficiency, over existing state-of-the-art meta-RL algorithms.

A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.

北京阿比特科技有限公司