亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We then show how to construct semiparametrically efficient estimators in these settings. We term the resulting framework proximal reinforcement learning (PRL). We demonstrate the benefits of PRL in an extensive simulation study and on the problem of sepsis management.

相關內容

Modern statistical estimation is often performed in a distributed setting where each sample belongs to a single user who shares their data with a central server. Users are typically concerned with preserving the privacy of their samples, and also with minimizing the amount of data they must transmit to the server. We give improved private and communication-efficient algorithms for estimating several popular measures of the entropy of a distribution. All of our algorithms have constant communication cost and satisfy local differential privacy. For a joint distribution over many variables whose conditional independence is given by a tree, we describe algorithms for estimating Shannon entropy that require a number of samples that is linear in the number of variables, compared to the quadratic sample complexity of prior work. We also describe an algorithm for estimating Gini entropy whose sample complexity has no dependence on the support size of the distribution and can be implemented using a single round of concurrent communication between the users and the server. In contrast, the previously best-known algorithm has high communication cost and requires the server to facilitate interaction between the users. Finally, we describe an algorithm for estimating collision entropy that generalizes the best known algorithm to the private and communication-efficient setting.

For large-scale cyber-physical systems, the collaboration of spatially distributed sensors is often needed to perform the state estimation process. Privacy concerns naturally arise from disclosing sensitive measurement signals to a cloud estimator that predicts the system state. To solve this issue, we propose a differentially private set-based estimation protocol that preserves the privacy of the measurement signals. Compared to existing research, our approach achieves less privacy loss and utility loss using a numerically optimized truncated noise distribution. The proposed estimator is perturbed by weaker noise than the analytical approaches in the literature to guarantee the same level of privacy, therefore improving the estimation utility. Numerical and comparison experiments with truncated Laplace noise are presented to support our approach. Zonotopes, a less conservative form of set representation, are used to represent estimation sets, giving set operations a computational advantage. The privacy-preserving noise anonymizes the centers of these estimated zonotopes, concealing the precise positions of the estimated zonotopes.

Recent work has focused on the potential and pitfalls of causal identification in observational studies with multiple simultaneous treatments. Building on previous work, we show that even if the conditional distribution of unmeasured confounders given treatments were known exactly, the causal effects would not in general be identifiable, although they may be partially identified. Given these results, we propose a sensitivity analysis method for characterizing the effects of potential unmeasured confounding, tailored to the multiple treatment setting, that can be used to characterize a range of causal effects that are compatible with the observed data. Our method is based on a copula factorization of the joint distribution of outcomes, treatments, and confounders, and can be layered on top of arbitrary observed data models. We propose a practical implementation of this approach making use of the Gaussian copula, and establish conditions under which causal effects can be bounded. We also describe approaches for reasoning about effects, including calibrating sensitivity parameters, quantifying robustness of effect estimates, and selecting models that are most consistent with prior hypotheses.

As IoT devices are becoming widely deployed, there exist many threats to IoT-based systems due to their inherent vulnerabilities. One effective approach to improving IoT security is to deploy IoT honeypot systems, which can collect attack information and reveal the methods and strategies used by attackers. However, building high-interaction IoT honeypots is challenging due to the heterogeneity of IoT devices. Vulnerabilities in IoT devices typically depend on specific device types or firmware versions, which encourages attackers to perform pre-attack checks to gather device information before launching attacks. Moreover, conventional honeypots are easily detected because their replying logic differs from that of the IoT devices they try to mimic. To address these problems, we develop an adaptive high-interaction honeypot for IoT devices, called HoneyIoT. We first build a real device based attack trace collection system to learn how attackers interact with IoT devices. We then model the attack behavior through markov decision process and leverage reinforcement learning techniques to learn the best responses to engage attackers based on the attack trace. We also use differential analysis techniques to mutate response values in some fields to generate high-fidelity responses. HoneyIoT has been deployed on the public Internet. Experimental results show that HoneyIoT can effectively bypass the pre-attack checks and mislead the attackers into uploading malware. Furthermore, HoneyIoT is covert against widely used reconnaissance and honeypot detection tools.

Diabetes is a global health priority, especially in low- and-middle-income countries, where over 50% of premature deaths are attributed to high blood glucose. Several studies have demonstrated the feasibility of using Community Health Worker (CHW) programs to provide affordable and culturally tailored solutions for early detection and management of diabetes. Yet, scalable models to design and implement CHW programs while accounting for screening, management, and patient enrollment decisions have not been proposed. We introduce an optimization framework to determine personalized CHW visits that maximize glycemic control at a community-level. Our framework explicitly models the trade-off between screening new patients and providing management visits to individuals who are already enrolled in treatment. We account for patients' motivational states, which affect their decisions to enroll or drop out of treatment and, therefore, the effectiveness of the intervention. We incorporate these decisions by modeling patients as utility-maximizing agents within a bi-level provider problem that we solve using approximate dynamic programming. By estimating patients' health and motivational states, our model builds visit plans that account for patients' tradeoffs when deciding to enroll in treatment, leading to reduced dropout rates and improved resource allocation. We apply our approach to generate CHW visit plans using operational data from a social enterprise serving low-income neighborhoods in urban areas of India. Through extensive simulation experiments, we find that our framework requires up to 73.4% less capacity than the best naive policy to achieve the same performance in terms of glycemic control. Our experiments also show that our solution algorithm can improve upon naive policies by up to 124.5% using the same CHW capacity.

We present a full implementation and simulation of a novel quantum reinforcement learning method. Our work is a detailed and formal proof of concept for how quantum algorithms can be used to solve reinforcement learning problems and shows that, given access to error-free, efficient quantum realizations of the agent and environment, quantum methods can yield provable improvements over classical Monte-Carlo based methods in terms of sample complexity. Our approach shows in detail how to combine amplitude estimation and Grover search into a policy evaluation and improvement scheme. We first develop quantum policy evaluation (QPE) which is quadratically more efficient compared to an analogous classical Monte Carlo estimation and is based on a quantum mechanical realization of a finite Markov decision process (MDP). Building on QPE, we derive a quantum policy iteration that repeatedly improves an initial policy using Grover search until the optimum is reached. Finally, we present an implementation of our algorithm for a two-armed bandit MDP which we then simulate.

Vanilla Reinforcement Learning (RL) can efficiently solve complex tasks but does not provide any guarantees on system behavior. Yet, for real-world systems, which are often safety-critical, such guarantees on safety specifications are necessary. To bridge this gap, we propose a safe RL procedure for continuous action spaces with verified probabilistic guarantees specified via temporal logic. First, our approach probabilistically verifies a candidate controller with respect to a temporal logic specification while randomizing the controller's inputs within an expansion set. Then, we use RL to improve the performance of this probabilistically verified controller and explore in the given expansion set around the controller's input. Finally, we calculate probabilistic safety guarantees with respect to temporal logic specifications for the learned agent. Our approach is efficiently implementable for continuous action and state spaces and separates safety verification and performance improvement into two distinct steps. We evaluate our approach on an evasion task where a robot has to reach a goal while evading a dynamic obstacle with a specific maneuver. Our results show that our safe RL approach leads to efficient learning while probablistically maintaining safety specifications.

The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision-making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value and policy based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. Our survey concludes by discussing the application of these RL algorithms in a variety of decision-making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo-advising.

Reinforcement learning is one of the core components in designing an artificial intelligent system emphasizing real-time response. Reinforcement learning influences the system to take actions within an arbitrary environment either having previous knowledge about the environment model or not. In this paper, we present a comprehensive study on Reinforcement Learning focusing on various dimensions including challenges, the recent development of different state-of-the-art techniques, and future directions. The fundamental objective of this paper is to provide a framework for the presentation of available methods of reinforcement learning that is informative enough and simple to follow for the new researchers and academics in this domain considering the latest concerns. First, we illustrated the core techniques of reinforcement learning in an easily understandable and comparable way. Finally, we analyzed and depicted the recent developments in reinforcement learning approaches. My analysis pointed out that most of the models focused on tuning policy values rather than tuning other things in a particular state of reasoning.

北京阿比特科技有限公司