日韩在线精品小视频_中文熟妇亚洲视频观看_精品处破在线观看美女视频_欧洲激情一区二区三区_97亚洲精品国偷自产在线_在线国产一区二区三区_久久国产精品久久精品国产四

Network load balancers are central components in modern data centers, that cooperatively distribute workloads of high arrival rates across application servers, thereby contribute to offering scalable services. The independent and "selfish" load balancing strategy is not necessarily the globally optimal one. This paper represents the load balancing problem as a cooperative team-game with limited observations over system states, and adopts multi-agent reinforcement learning methods to make fair load balancing decisions without inducing additional processing latency. On both a simulation and an emulation system, the proposed method is evaluated against other load balancing algorithms, including state-of-the-art heuristics and learning-based strategies. Experiments under different settings and complexities show the advantageous performance of the proposed method.

相關內容

Performer

關注 10

學成 · 部分可觀測馬爾可夫決策過程 · 回合 · 強化學習 · Processing（編程語言） ·

2022 年 4 月 20 日

Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning

Zhiwei Xu,Dapeng Li,Bin Zhang,Yuan Zhan,Yunpeng Bai,Guoliang Fan

from arxiv, 10 pages, 7 figures, 2 tables

Recently, model-based agents have achieved better performance compared with model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is very difficult to learn the model of the environment. When model-based methods are applied to multi-agent tasks, the significant compounding error may hinder the learning process. In this paper, we propose an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states, which makes agents have foresight. Our method can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in partially observable Markov decision process domains.

學成 · INFORMS · MoDELS · 深度強化學習 · 強化學習 ·

2022 年 4 月 19 日

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Jinmei Liu,Zhi Wang,Chunlin Chen,Daoyi Dong

from arxiv, 16 pages, 6 figures, under review

Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.

學成 · 貢獻度分配問題 · 可約的 · 強化學習 · 極大 ·

2022 年 4 月 19 日

Multi-UAV Collision Avoidance using Multi-Agent Reinforcement Learning with Counterfactual Credit Assignment

Shuangyao Huang,Haibo Zhang,Zhiyi Huang

Multi-UAV collision avoidance is a challenging task for UAV swarm applications due to the need of tight cooperation among swarm members for collision-free path planning. Centralized Training with Decentralized Execution (CTDE) in Multi-Agent Reinforcement Learning is a promising method for multi-UAV collision avoidance, in which the key challenge is to effectively learn decentralized policies that can maximize a global reward cooperatively. We propose a new multi-agent critic-actor learning scheme called MACA for UAV swarm collision avoidance. MACA uses a centralized critic to maximize the discounted global reward that considers both safety and energy efficiency, and an actor per UAV to find decentralized policies to avoid collisions. To solve the credit assignment problem in CTDE, we design a counterfactual baseline that marginalizes both an agent's state and action, enabling to evaluate the importance of an agent in the joint observation-action space. To train and evaluate MACA, we design our own simulation environment MACAEnv to closely mimic the realistic behaviors of a UAV swarm. Simulation results show that MACA achieves more than 16% higher average reward than two state-of-the-art MARL algorithms and reduces failure rate by 90% and response time by over 99% compared to a conventional UAV swarm collision avoidance algorithm in all test scenarios.

潛變量/隱變量 · 生成模型 · MoDELS · Performer · 學成 ·

2022 年 4 月 18 日

Training and Evaluation of Deep Policies using Reinforcement Learning and Generative Models

Ali Ghadirzadeh,Petra Poklukar,Karol Arndt,Chelsea Finn,Ville Kyrki,Danica Kragic,M?rten Bj?rkman

from arxiv, arXiv admin note: substantial text overlap with arXiv:2007.13134

We present a data-efficient framework for solving sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models. The framework, called GenRL, trains deep policies by introducing an action latent variable such that the feed-forward policy search can be divided into two parts: (i) training a sub-policy that outputs a distribution over the action latent variable given a state of the system, and (ii) unsupervised training of a generative model that outputs a sequence of motor actions conditioned on the latent action variable. GenRL enables safe exploration and alleviates the data-inefficiency problem as it exploits prior knowledge about valid sequences of motor actions. Moreover, we provide a set of measures for evaluation of generative models such that we are able to predict the performance of the RL policy training prior to the actual training on a physical robot. We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training on two robotics tasks: shooting a hockey puck and throwing a basketball. Furthermore, we empirically demonstrate that GenRL is the only method which can safely and efficiently solve the robotics tasks compared to two state-of-the-art RL methods.

學成 · Neural Networks · 強化學習 · 深度強化學習 · 知識 (knowledge) ·

2022 年 4 月 14 日

Methodical Advice Collection and Reuse in Deep Reinforcement Learning

Sahir,Ercüment ?lhan,Srijita Das,Matthew E. Taylor

from arxiv, To be published in ALA2022: Adaptive and Learning Agents Workshop 2022 at AAMAS

Reinforcement learning (RL) has shown great success in solving many challenging tasks via use of deep neural networks. Although using deep learning for RL brings immense representational power, it also causes a well-known sample-inefficiency problem. This means that the algorithms are data-hungry and require millions of training samples to converge to an adequate policy. One way to combat this issue is to use action advising in a teacher-student framework, where a knowledgeable teacher provides action advice to help the student. This work considers how to better leverage uncertainties about when a student should ask for advice and if the student can model the teacher to ask for less advice. The student could decide to ask for advice when it is uncertain or when both it and its model of the teacher are uncertain. In addition to this investigation, this paper introduces a new method to compute uncertainty for a deep RL agent using a secondary neural network. Our empirical results show that using dual uncertainties to drive advice collection and reuse may improve learning performance across several Atari games.

任務對話系統 · INTERACT · 學成 · 話題 · 情景 ·

2022 年 4 月 7 日

Interacting with Non-Cooperative User: A New Paradigm for Proactive Dialogue Policy

Wenqiang Lei,Yao Zhang,Feifan Song,Hongru Liang,Jiaxin Mao,Jiancheng Lv,Zhenglu Yang,Tat-Seng Chua

from arxiv, Accepted to SIGIR 2022

Proactive dialogue system is able to lead the conversation to a goal topic and has advantaged potential in bargain, persuasion and negotiation. Current corpus-based learning manner limits its practical application in real-world scenarios. To this end, we contribute to advance the study of the proactive dialogue policy to a more natural and challenging setting, i.e., interacting dynamically with users. Further, we call attention to the non-cooperative user behavior -- the user talks about off-path topics when he/she is not satisfied with the previous topics introduced by the agent. We argue that the targets of reaching the goal topic quickly and maintaining a high user satisfaction are not always converge, because the topics close to the goal and the topics user preferred may not be the same. Towards this issue, we propose a new solution named I-Pro that can learn Proactive policy in the Interactive setting. Specifically, we learn the trade-off via a learned goal weight, which consists of four factors (dialogue turn, goal completion difficulty, user satisfaction estimation, and cooperative degree). The experimental results demonstrate I-Pro significantly outperforms baselines in terms of effectiveness and interpretability.

負例 · INFORMS · 圖 · MoDELS · 樣本 ·

2020 年 3 月 12 日

Reinforced Negative Sampling over Knowledge Graph for Recommendation

Xiang Wang,Yaokun Xu,Xiangnan He,Yixin Cao,Meng Wang,Tat-Seng Chua

from arxiv, WWW 2020 oral presentation

Properly handling missing data is a fundamental challenge in recommendation. Most present works perform negative sampling from unobserved data to supply the training of recommender models with negative signals. Nevertheless, existing negative sampling strategies, either static or adaptive ones, are insufficient to yield high-quality negative samples --- both informative to model training and reflective of user real needs. In this work, we hypothesize that item knowledge graph (KG), which provides rich relations among items and KG entities, could be useful to infer informative and factual negative samples. Towards this end, we develop a new negative sampling model, Knowledge Graph Policy Network (KGPolicy), which works as a reinforcement learning agent to explore high-quality negatives. Specifically, by conducting our designed exploration operations, it navigates from the target positive interaction, adaptively receives knowledge-aware negative signals, and ultimately yields a potential negative item to train the recommender. We tested on a matrix factorization (MF) model equipped with KGPolicy, and it achieves significant improvements over both state-of-the-art sampling methods like DNS and IRGAN, and KG-enhanced recommender models like KGAT. Further analyses from different angles provide insights of knowledge-aware sampling. We release the codes and datasets at //github.com/xiangwang1223/kgpolicy.

學成 · 強化學習 · Performer · Better · state-of-the-art ·

2020 年 2 月 10 日

Q-value Path Decomposition for Deep Multiagent Reinforcement Learning

Yaodong Yang,Jianye Hao,Guangyong Chen,Hongyao Tang,Yingfeng Chen,Yujing Hu,Changjie Fan,Zhongyu Wei

Recently, deep multiagent reinforcement learning (MARL) has become a highly active research area as many real-world problems can be inherently viewed as multiagent systems. A particularly interesting and widely applicable class of problems is the partially observable cooperative multiagent setting, in which a team of agents learns to coordinate their behaviors conditioning on their private observations and commonly shared global reward signals. One natural solution is to resort to the centralized training and decentralized execution paradigm. During centralized training, one key challenge is the multiagent credit assignment: how to allocate the global rewards for individual agent policies for better coordination towards maximizing system-level's benefits. In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system's global Q-values into individual agents' Q-values. Unlike previous works which restrict the representation relation of the individual Q-values and the global one, we leverage the integrated gradient attribution technique into deep MARL to directly decompose global Q-values along trajectory paths to assign credits for agents. We evaluate QPD on the challenging StarCraft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms.

強化學習 · 學成 · tuning · 回合 · 有向 ·

2020 年 1 月 19 日

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

Amit Kumar Mondal,Nadeem Jamali

Reinforcement learning is one of the core components in designing an artificial intelligent system emphasizing real-time response. Reinforcement learning influences the system to take actions within an arbitrary environment either having previous knowledge about the environment model or not. In this paper, we present a comprehensive study on Reinforcement Learning focusing on various dimensions including challenges, the recent development of different state-of-the-art techniques, and future directions. The fundamental objective of this paper is to provide a framework for the presentation of available methods of reinforcement learning that is informative enough and simple to follow for the new researchers and academics in this domain considering the latest concerns. First, we illustrated the core techniques of reinforcement learning in an easily understandable and comparable way. Finally, we analyzed and depicted the recent developments in reinforcement learning approaches. My analysis pointed out that most of the models focused on tuning policy values rather than tuning other things in a particular state of reasoning.

2018 年 1 月 5 日

Deep Reinforcement Learning for List-wise Recommendations

Xiangyu Zhao,Liang Zhang,Zhuoye Ding,Dawei Yin,Yihong Zhao,Jiliang Tang

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.