亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

We study dynamic discrete choice models, where a commonly studied problem involves estimating parameters of agent reward functions (also known as "structural" parameters), using agent behavioral data. Maximum likelihood estimation for such models requires dynamic programming, which is limited by the curse of dimensionality. In this work, we present a novel algorithm that provides a data-driven method for selecting and aggregating states, which lowers the computational and sample complexity of estimation. Our method works in two stages. In the first stage, we use a flexible inverse reinforcement learning approach to estimate agent Q-functions. We use these estimated Q-functions, along with a clustering algorithm, to select a subset of states that are the most pivotal for driving changes in Q-functions. In the second stage, with these selected "aggregated" states, we conduct maximum likelihood estimation using a commonly used nested fixed-point algorithm. The proposed two-stage approach mitigates the curse of dimensionality by reducing the problem dimension. Theoretically, we derive finite-sample bounds on the associated estimation error, which also characterize the trade-off of computational complexity, estimation error, and sample complexity. We demonstrate the empirical performance of the algorithm in two classic dynamic discrete choice estimation applications.

相關內容

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

High dynamic range (HDR) imaging is an important task in image processing that aims to generate well-exposed images in scenes with varying illumination. Although existing multi-exposure fusion methods have achieved impressive results, generating high-quality HDR images in dynamic scenes is still difficult. The primary challenges are ghosting artifacts caused by object motion between low dynamic range images and distorted content in under and overexposed regions. In this paper, we propose a deep progressive feature aggregation network for improving HDR imaging quality in dynamic scenes. To address the issues of object motion, our method implicitly samples high-correspondence features and aggregates them in a coarse-to-fine manner for alignment. In addition, our method adopts a densely connected network structure based on the discrete wavelet transform, which aims to decompose the input features into multiple frequency subbands and adaptively restore corrupted contents. Experiments show that our proposed method can achieve state-of-the-art performance under different scenes, compared to other promising HDR imaging methods. Specifically, the HDR images generated by our method contain cleaner and more detailed content, with fewer distortions, leading to better visual quality.

This article describes an R package bqror that estimates Bayesian quantile regression for ordinal models introduced in Rahman (2016). The paper classifies ordinal models into two types and offers computationally efficient, yet simple, Markov chain Monte Carlo (MCMC) algorithms for estimating ordinal quantile regression. The generic ordinal model with 3 or more outcomes (labeled ORI model) is estimated by a combination of Gibbs sampling and Metropolis-Hastings algorithm. Whereas an ordinal model with exactly 3 outcomes (labeled ORII model) is estimated using Gibbs sampling only. In line with the Bayesian literature, we suggest using marginal likelihood for comparing alternative quantile regression models and explain how to compute the same. The models and their estimation procedures are illustrated via multiple simulation studies and implemented in two applications. The article also describes several other functions contained within the bqror package, which are necessary for estimation, inference, and assessing model fit.

Training multiple agents to coordinate is an important problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) challenges, two coordination issues at which current offline MARL algorithms fail. To address this setback, we propose a simple model-based approach that generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. Our resulting method, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO), outperforms the prevalent learning methods in challenging offline multi-agent MuJoCo tasks even under severe partial observability and with learned world models.

Despite recent advancements, NLP models continue to be vulnerable to bias. This bias often originates from the uneven distribution of real-world data and can propagate through the annotation process. Escalated integration of these models in our lives calls for methods to mitigate bias without overbearing annotation costs. While active learning (AL) has shown promise in training models with a small amount of annotated data, AL's reliance on the model's behavior for selective sampling can lead to an accumulation of unwanted bias rather than bias mitigation. However, infusing clustering with AL can overcome the bias issue of both AL and traditional annotation methods while exploiting AL's annotation efficiency. In this paper, we propose a novel adaptive clustering-based active learning algorithm, D-CALM, that dynamically adjusts clustering and annotation efforts in response to an estimated classifier error-rate. Experiments on eight datasets for a diverse set of text classification tasks, including emotion, hatespeech, dialog act, and book type detection, demonstrate that our proposed algorithm significantly outperforms baseline AL approaches with both pretrained transformers and traditional Support Vector Machines. D-CALM showcases robustness against different measures of information gain and, as evident from our analysis of label and error distribution, can significantly reduce unwanted model bias.

Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains.

Most reinforcement learning algorithms treat the context under which they operate as a stationary, isolated and undisturbed environment. However, in the real world, the environment is constantly changing due to a variety of external influences. To address this problem, we study Markov Decision Processes (MDP) under the influence of an external temporal process. We formalize this notion and discuss conditions under which the problem becomes tractable with suitable solutions. We propose a policy iteration algorithm to solve this problem and theoretically analyze its performance.

While distributional reinforcement learning (RL) has demonstrated empirical success, the question of when and why it is beneficial has remained unanswered. In this work, we provide one explanation for the benefits of distributional RL through the lens of small-loss bounds, which scale with the instance-dependent optimal cost. If the optimal cost is small, our bounds are stronger than those from non-distributional approaches. As warmup, we show that learning the cost distribution leads to small-loss regret bounds in contextual bandits (CB), and we find that distributional CB empirically outperforms the state-of-the-art on three challenging tasks. For online RL, we propose a distributional version-space algorithm that constructs confidence sets using maximum likelihood estimation, and we prove that it achieves small-loss regret in the tabular MDPs and enjoys small-loss PAC bounds in latent variable models. Building on similar insights, we propose a distributional offline RL algorithm based on the pessimism principle and prove that it enjoys small-loss PAC bounds, which exhibit a novel robustness property. For both online and offline RL, our results provide the first theoretical benefits of learning distributions even when we only need the mean for making decisions.

This paper presents an algorithm that generates the conditional moment inequalities that characterize the identified set of the common parameter of various semi-parametric panel multinomial choice models. I consider both static and dynamic models, and consider various weak stochastic restrictions on the distribution of observed and unobserved components of the models. For a broad class of such stochastic restrictions, the paper demonstrates that the inequalities characterizing the identified set of the common parameter can be obtained as solutions of multiple objective linear programs (MOLPs), thereby transforming the task of finding these inequalities into a purely computational problem. The algorithm that I provide reproduces many well-known results, including the conditional moment inequalities derived in Manski 1987, Pakes and Porter 2023, and Khan, Ponomareva, and Tamer 2023. Moreover, I use the algorithm to generate some new results, by providing characterizations of the identified set in some cases that were left open in Pakes and Porter 2023 and Khan, Ponomareva, and Tamer 2023, as well as characterizations of the identified set under alternative stochastic restrictions.

Flexible district heating grids form an important part of future, low-carbon energy systems. We examine probabilistic state estimation in such grids, i.e., we aim to estimate the posterior probability distribution over all grid state variables such as pressures, temperatures, and mass flows conditional on measurements of a subset of these states. Since the posterior state distribution does not belong to a standard class of probability distributions, we use Markov Chain Monte Carlo (MCMC) sampling in the space of network heat exchanges and evaluate the samples in the grid state space to estimate the posterior. Converting the heat exchange samples into grid states by solving the non-linear grid equations makes this approach computationally burdensome. However, we propose to speed it up by employing a deep neural network that is trained to approximate the solution of the exact but slow non-linear solver. This novel approach is shown to deliver highly accurate posterior distributions both for classic tree-shaped as well as meshed heating grids, at significantly reduced computational costs that are acceptable for online control. Our state estimation approach thus enables tightening the safety margins for temperature and pressure control and thereby a more efficient grid operation.

北京阿比特科技有限公司