国产一本二本三本的区别视频_黄色视频在线观看男人插女人的视频在线观看_精品国产91久久久久久久下载_日韩纯肉无遮挡一区二区视频_97碰久日韩视频在线观看_日韩精品一区二区三区在线观看_久久国产还是这里有精品视频

Continuous games have compact strategy sets and continuous utility functions. Such games can have a highly complicated structure of Nash equilibria. Algorithms and numerical methods for the equilibrium computation are known only for particular classes of continuous games such as two-person polynomial games or games with pure equilibria. This contribution focuses on the computation and approximation of a mixed strategy equilibrium for the whole class of multiplayer general-sum continuous games. We extend vastly the scope of applicability of the double oracle algorithm, which was initially designed and proved to converge only for two-person zero-sum games. Specifically, we propose an iterative strategy generation technique, which splits the original problem into the master problem with only a finite subset of strategies being considered, and the subproblem in which an oracle finds the best response of each player. This simple method is guaranteed to recover an approximate equilibrium in finitely many iterations. Further, we argue that the Wasserstein distance (the earth mover's distance) is the right metric on the space of mixed strategies for our purposes. Our main result is the convergence of this algorithm in the Wasserstein distance to an equilibrium of the original continuous game. The numerical experiments show the performance of our method on several examples of games appearing in the literature.

相關內容

Continuity

關注 4

讓 iOS 8 和 OS X Yosemite 無縫切換的一個新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source:

離散化 · 估計/估計量 · CASE · 模型評估 · 原點 ·

2021 年 10 月 31 日

A posteriori error analysis for a space-time parallel discretization of parabolic partial differential equations

Jehanzeb Chaudhry,Donald Estep,Simon Tavener

We construct a space-time parallel method for solving parabolic partial differential equations by coupling the Parareal algorithm in time with overlapping domain decomposition in space. Reformulating the original Parareal algorithm as a variational method and implementing a finite element discretization in space enables an adjoint-based a posteriori error analysis to be performed. Through an appropriate choice of adjoint problems and residuals the error analysis distinguishes between errors arising due to the temporal and spatial discretizations, as well as between the errors arising due to incomplete Parareal iterations and incomplete iterations of the domain decomposition solver. We first develop an error analysis for the Parareal method applied to parabolic partial differential equations, and then refine this analysis to the case where the associated spatial problems are solved using overlapping domain decomposition. These constitute our Time Parallel Algorithm (TPA) and Space-Time Parallel Algorithm (STPA) respectively. Numerical experiments demonstrate the accuracy of the estimator for both algorithms and the iterations between distinct components of the error.

Weight · 學成 · 學習率 · 前向 · Performer ·

2021 年 10 月 31 日

Forward Looking Best-Response Multiplicative Weights Update Methods for Bilinear Zero-sum Games

Michail Fasoulakis,Evangelos Markakis,Yannis Pantazis,Constantinos Varsos

Our work focuses on extra gradient learning algorithms for finding Nash equilibria in bilinear zero-sum games. The proposed method, which can be formally considered as a variant of Optimistic Mirror Descent \cite{DBLP:conf/iclr/MertikopoulosLZ19}, uses a large learning rate for the intermediate gradient step which essentially leads to computing (approximate) best response strategies against the profile of the previous iteration. Although counter-intuitive at first sight due to the irrationally large, for an iterative algorithm, intermediate learning step, we prove that the method guarantees last-iterate convergence to an equilibrium. Particularly, we show that the algorithm reaches first an $\eta^{1/\rho}$-approximate Nash equilibrium, with $\rho > 1$, by decreasing the Kullback-Leibler divergence of each iterate by at least $\Omega(\eta^{1+\frac{1}{\rho}})$, for sufficiently small learning rate, $\eta$, until the method becomes a contracting map, and converges to the exact equilibrium. Furthermore, we perform experimental comparisons with the optimistic variant of the multiplicative weights update method, by \cite{Daskalakis2019LastIterateCZ} and show that our algorithm has significant practical potential since it offers substantial gains in terms of accelerated convergence.

廣義函數 · 泛函 · 樣本復雜度 · 情景 · 轉移核 ·

2021 年 10 月 30 日

Towards General Function Approximation in Zero-Sum Markov Games

Baihe Huang,Jason D. Lee,Zhaoran Wang,Zhuoran Yang

This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $\sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $\sqrt{K}$-regret upper bound where $K$ is the number of episodes.

優化器 · CASES · Lipschitz連續 · 主問題 · Continuity ·

2021 年 10 月 28 日

A New Randomized Primal-Dual Algorithm for Convex Optimization with Optimal Last Iterate Rates

Quoc Tran-Dinh,Deyi Liu

from arxiv, 29, 5 figures

We develop a novel unified randomized block-coordinate primal-dual algorithm to solve a class of nonsmooth constrained convex optimization problems, which covers different existing variants and model settings from the literature. We prove that our algorithm achieves optimal $\mathcal{O}(n/k)$ and $\mathcal{O}(n^2/k^2)$ convergence rates (up to a constant factor) in two cases: general convexity and strong convexity, respectively, where $k$ is the iteration counter and n is the number of block-coordinates. Our convergence rates are obtained through three criteria: primal objective residual and primal feasibility violation, dual objective residual, and primal-dual expected gap. Moreover, our rates for the primal problem are on the last iterate sequence. Our dual convergence guarantee requires additionally a Lipschitz continuity assumption. We specify our algorithm to handle two important special cases, where our rates are still applied. Finally, we verify our algorithm on two well-studied numerical examples and compare it with two existing methods. Our results show that the proposed method has encouraging performance on different experiments.

解碼 · Sphering · 秩 · 確切的 · 相同 ·

2021 年 10 月 28 日

Generic Decoding in the Sum-Rank Metric

Sven Puchinger,Julian Renner,Johan Rosenkilde

We propose the first non-trivial generic decoding algorithm for codes in the sum-rank metric. The new method combines ideas of well-known generic decoders in the Hamming and rank metric. For the same code parameters and number of errors, the new generic decoder has a larger expected complexity than the known generic decoders for the Hamming metric and smaller than the known rank-metric decoders. Furthermore, we give a formal hardness reduction, providing evidence that generic sum-rank decoding is computationally hard. As a by-product of the above, we solve some fundamental coding problems in the sum-rank metric: we give an algorithm to compute the exact size of a sphere of a given sum-rank radius, and also give an upper bound as a closed formula; and we study erasure decoding with respect to two different notions of support.

INTERACT · 學成 · 向量化 · 優化器 · 強化學習 ·

2021 年 10 月 27 日

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

Jingfeng Wu,Vladimir Braverman,Lin F. Yang

from arxiv, NeurIPS 2021 Camera Ready Version

In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)$, where $d$ is the number of objectives, $S$ is the number of states, $A$ is the number of actions, $H$ is the length of the horizon, and $K$ is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to $\epsilon$ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity $\widetilde{\mathcal{O}}\bigl({\min\{d,S\}\cdot H^3 SA}/{\epsilon^2}\bigr)$. This result partly resolves an open problem raised by \citet{jin2020reward}.

近似 · INFORMS · 納什均衡 · 強化學習 · 學成 ·

2020 年 6 月 15 日

Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games

Stephen McAleer,John Lanier,Roy Fox,Pierre Baldi

from arxiv, SM and JL contributed equally

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.

異方差 · 上置信界限 · 深度 Q 學習 · 強化學習 · 學成 ·

2018 年 12 月 18 日

Information-Directed Exploration for Deep Reinforcement Learning

Nikolay Nikolov,Johannes Kirschner,Felix Berkenkamp,Andreas Krause

Efficient exploration remains a major challenge for reinforcement learning. One reason is that the variability of the returns often depends on the current state and action, and is therefore heteroscedastic. Classical exploration strategies such as upper confidence bound algorithms and Thompson sampling fail to appropriately account for heteroscedasticity, even in the bandit setting. Motivated by recent findings that address this issue in bandits, we propose to use Information-Directed Sampling (IDS) for exploration in reinforcement learning. As our main contribution, we build on recent advances in distributional reinforcement learning and propose a novel, tractable approximation of IDS for deep Q-learning. The resulting exploration strategy explicitly accounts for both parametric uncertainty and heteroscedastic observation noise. We evaluate our method on Atari games and demonstrate a significant improvement over alternative approaches.

優化器 · 強化學習 · 學成 · state-of-the-art · SimPLe ·

2018 年 7 月 25 日

Variational Bayesian Reinforcement Learning with Regret Bounds

Brendan O'Donoghue

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

優化器 · Extensibility · 對偶問題 · 平滑 · INTERACT ·

2017 年 12 月 1 日

Optimal Algorithms for Distributed Optimization

César A. Uribe,Soomin Lee,Alexander Gasnikov,Angelia Nedi?

In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.