亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks.

相關內容

We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning, where the caching policy has access to a prediction oracle (provided by, e.g., a Neural Network). The successive file requests are assumed to be generated by an adversary, and no assumption is made on the accuracy of the oracle. In this setting, we provide a universal lower bound for prediction-assisted online caching and proceed to design a suite of policies with a range of performance-complexity trade-offs. All proposed policies offer sublinear regret bounds commensurate with the accuracy of the oracle. Our results substantially improve upon all recently-proposed online caching policies, which, being unable to exploit the oracle predictions, offer only $O(\sqrt{T})$ regret. In this pursuit, we design, to the best of our knowledge, the first comprehensive optimistic Follow-the-Perturbed leader policy, which generalizes beyond the caching problem. We also study the problem of caching files with different sizes and the bipartite network caching problem. Finally, we evaluate the efficacy of the proposed policies through extensive numerical experiments using real-world traces.

This paper considers the state estimation problem for nonlinear dynamic systems with unknown but bounded noises. Set membership filter (SMF) is a popular algorithm to solve this problem. In the set membership setting, we investigate the filter problem where the state estimation requires to be constrained by a linear or nonlinear equality. We propose a consensus alternating direction method of multipliers (ADMM) based SMF algorithm for nonlinear dynamic systems. To deal with the difficulty of nonlinearity, instead of linearizing the nonlinear system, a semi-infinite programming (SIP) approach is used to transform the nonlinear system into a linear one, which allows us to obtain a more accurate estimation ellipsoid. For the solution of the SIP, an ADMM algorithm is proposed to handle the state estimation constraints, and each iteration of the algorithm can be solved efficiently. Finally, the proposed filter is applied to typical numerical examples to demonstrate its effectiveness.

Partitioning algorithms play a key role in many scientific and engineering disciplines. A partitioning algorithm divides a set into a number of disjoint subsets or partitions. Often, the quality of the resulted partitions is measured by the amount of impurity in each partition, the smaller impurity the higher quality of the partitions. In general, for a given impurity measure specified by a function of the partitions, finding the minimum impurity partitions is an NP-hard problem. Let $M$ be the number of $N$-dimensional elements in a set and $K$ be the number of desired partitions, then an exhaustive search over all the possible partitions to find a minimum partition has the complexity of $O(K^M)$ which quickly becomes impractical for many applications with modest values of $K$ and $M$. Thus, many approximate algorithms with polynomial time complexity have been proposed, but few provide bounded guarantee. In this paper, an upper bound and a lower bound for a class of impurity functions are constructed. Based on these bounds, we propose a low-complexity partitioning algorithm with bounded guarantee based on the maximum likelihood principle. The theoretical analyses on the bounded guarantee of the algorithms are given for two well-known impurity functions Gini index and entropy. When $K \geq N$, the proposed algorithm achieves state-of-the-art results in terms of lowest approximations and polynomial time complexity $O(NM)$. In addition, a heuristic greedy-merge algorithm having the time complexity of $O((N-K)N^2+NM)$ is proposed for $K<N$. Although the greedy-merge algorithm does not provide a bounded guarantee, its performance is comparable to that of the state-of-the-art methods. Our results also generalize some well-known information-theoretic bounds such as Fano's inequality and Boyd-Chiang's bound.

This work considers Gaussian process interpolation with a periodized version of the Mat{\'e}rn covariance function (Stein, 1999, Section 6.7) with Fourier coefficients $\phi$($\alpha$^2 + j^2)^(--$\nu$--1/2). Convergence rates are studied for the joint maximum likelihood estimation of $\nu$ and $\phi$ when the data is sampled according to the model. The mean integrated squared error is also analyzed with fixed and estimated parameters, showing that maximum likelihood estimation yields asymptotically the same error as if the ground truth was known. Finally, the case where the observed function is a ''deterministic'' element of a continuous Sobolev space is also considered, suggesting that bounding assumptions on some parameters can lead to different estimates.

Motivated by recent progress on online linear programming (OLP), we study the online decision making problem (ODMP) as a natural generalization of OLP. In ODMP, there exists a single decision maker who makes a series of decisions spread out over a total of $T$ time stages. At each time stage, the decision maker makes a decision based on information obtained up to that point without seeing into the future. The task of the decision maker is to maximize the accumulated reward while overall meeting some predetermined $m$-dimensional long-term goal (linking) constraints. ODMP significantly broadens the modeling framework of OLP by allowing more general feasible regions (for local and goal constraints) potentially involving both discreteness and nonlinearity in each local decision making problem. We propose a Fenchel dual-based online algorithm for ODMP. At each time stage, the proposed algorithm requires solving a potentially nonconvex optimization problem over the local feasible set and a convex optimization problem over the goal set. Under the uniform random permutation model, we show that our algorithm achieves $O(\sqrt{mT})$ constraint violation deterministically in meeting the long-term goals, and $O(\sqrt{m\log m}\sqrt{T})$ competitive difference in expected reward with respect to the optimal offline decisions. We also extend our results to the grouped random permutation model.

Multistate Markov models are a canonical parametric approach for data modeling of observed or latent stochastic processes supported on a finite state space. Continuous-time Markov processes describe data that are observed irregularly over time, as is often the case in longitudinal medical and biological data sets, for example. Assuming that a continuous-time Markov process is time-homogeneous, a closed-form likelihood function can be derived from the Kolmogorov forward equations -- a system of differential equations with a well-known matrix-exponential solution. Unfortunately, however, the forward equations do not admit an analytical solution for continuous-time, time-inhomogeneous Markov processes, and so researchers and practitioners often make the simplifying assumption that the process is piecewise time-homogeneous. In this paper, we provide intuitions and illustrations of the potential biases for parameter estimation that may ensue in the more realistic scenario that the piecewise-homogeneous assumption is violated, and we advocate for a solution for likelihood computation in a truly time-inhomogeneous fashion. Particular focus is afforded to the context of multistate Markov models that allow for state label misclassifications, which applies more broadly to hidden Markov models (HMMs), and Bayesian computations bypass the necessity for computationally demanding numerical gradient approximations for obtaining maximum likelihood estimates (MLEs).

In longitudinal panels and other regression models with unobserved effects, fixed effects estimation is often paired with cluster-robust variance estimation (CRVE) in order to account for heteroskedasticity and un-modeled dependence among the errors. CRVE is asymptotically consistent as the number of independent clusters increases, but can be biased downward for sample sizes often found in applied work, leading to hypothesis tests with overly liberal rejection rates. One solution is to use bias-reduced linearization (BRL), which corrects the CRVE so that it is unbiased under a working model, and t-tests with Satterthwaite degrees of freedom. We propose a generalization of BRL that can be applied in models with arbitrary sets of fixed effects, where the original BRL method is undefined, and describe how to apply the method when the regression is estimated after absorbing the fixed effects. We also propose a small-sample test for multiple-parameter hypotheses, which generalizes the Satterthwaite approximation for t-tests. In simulations covering a variety of study designs, we find that conventional cluster-robust Wald tests can severely under-reject while the proposed small-sample test maintains Type I error very close to nominal levels.

Reinforcementlearning(RL)folkloresuggeststhathistory-basedfunctionapproximationmethods,suchas recurrent neural nets or history-based state abstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks.

We consider space-time tracking type distributed optimal control problems for the wave equation in the space-time domain $Q:= \Omega \times (0,T) \subset {\mathbb{R}}^{n+1}$, where the control is assumed to be in the energy space $[H_{0;,0}^{1,1}(Q)]^*$, rather than in $L^2(Q)$ which is more common. While the latter ensures a unique state in the Sobolev space $H^{1,1}_{0;0,}(Q)$, this does not define a solution isomorphism. Hence we use an appropriate state space $X$ such that the wave operator becomes an isomorphism from $X$ onto $[H_{0;,0}^{1,1}(Q)]^*$. Using space-time finite element spaces of piecewise linear continuous basis functions on completely unstructured but shape regular simplicial meshes, we derive a priori estimates for the error $\|\widetilde{u}_{\varrho h}-\overline{u}\|_{L^2(Q)}$ between the computed space-time finite element solution $\widetilde{u}_{\varrho h}$ and the target function $\overline{u}$ with respect to the regularization parameter $\varrho$, and the space-time finite element mesh-size $h$, depending on the regularity of the desired state $\overline{u}$. These estimates lead to the optimal choice $\varrho=h^2$ in order to define the regularization parameter $\varrho$ for a given space-time finite element mesh size $h$, or to determine the required mesh size $h$ when $\varrho$ is a given constant representing the costs of the control. The theoretical results will be supported by numerical examples with targets of different regularities, including discontinuous targets. Furthermore, an adaptive space-time finite element scheme is proposed and numerically analyzed.

This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at //github.com/Mamba413/cope.

北京阿比特科技有限公司