亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also provide a sharp trade-off between the amount of UCB exploration and the tail exponent of the resulting regret distribution.

相關內容

In an earlier paper (//doi.org/10.1137/21M1393315), the Switch Point Algorithm was developed for solving optimal control problems whose solutions are either singular or bang-bang or both singular and bang-bang, and which possess a finite number of jump discontinuities in an optimal control at the points in time where the solution structure changes. The class of control problems that were considered had a given initial condition, but no terminal constraint. The theory is now extended to include problems with both initial and terminal constraints, a structure that often arises in boundary-value problems. Substantial changes to the theory are needed to handle this more general setting. Nonetheless, the derivative of the cost with respect to a switch point is again the jump in the Hamiltonian at the switch point.

Linearizability is a standard correctness criterion for concurrent algorithms, typically proved by establishing the algorithms' linearization points (LP). However, LPs often hinder abstraction, and for some algorithms such as the timestamped stack, it is unclear how to even identify their LPs. In this paper, we show how to develop declarative proofs of linearizability by foregoing LPs and instead employing axiomatization of so-called visibility relations. While visibility relations have been considered before for the timestamped stack, our study is the first to show how to derive the axiomatization systematically and intuitively from the sequential specification of the stack. In addition to the visibility relation, a novel separability relation emerges to generalize real-time precedence of procedure invocation. The visibility and separability relations have natural definitions for the timestamped stack, and enable a novel proof that reduces the algorithm to a simplified form where the timestamps are generated atomically.

Storage codes on graphs are an instance of \emph{codes with locality}, which are used in distributed storage schemes to provide local repairability. Specifically, the nodes of the graph correspond to storage servers, and the neighbourhood of each server constitute the set of servers it can query to repair its stored data in the event of a failure. A storage code on a graph with $n$-vertices is a set of $n$-length codewords over $\field_q$ where the $i$th codeword symbol is stored in server $i$, and it can be recovered by querying the neighbours of server $i$ according to the underlying graph. In this work, we look at binary storage codes whose repair function is the parity check, and characterise the tradeoff between the locality of the code and its rate. Specifically, we show that the maximum rate of a code on $n$ vertices with locality $r$ is bounded between $1-1/n\lceil n/(r+1)\rceil$ and $1-1/n\lceil n/(r+1)\rceil$. The lower bound on the rate is derived by constructing an explicit family of graphs with locality $r$, while the upper bound is obtained via a lower bound on the binary-field rank of a class of symmetric binary matrices. Our upper bound on maximal rate of a storage code matches the upper bound on the larger class of codes with locality derived by Tamo and Barg. As a corollary to our result, we obtain the following asymptotic separation result: given a sequence $r(n), n\geq 1$, there exists a sequence of graphs on $n$-vertices with storage codes of rate $1-o(1)$ if and only if $r(n)=\omega(1)$.

Subgradient methods are the natural extension to the non-smooth case of the classical gradient descent for regular convex optimization problems. However, in general, they are characterized by slow convergence rates, and they require decreasing step-sizes to converge. In this paper we propose a subgradient method with constant step-size for composite convex objectives with $\ell_1$-regularization. If the smooth term is strongly convex, we can establish a linear convergence result for the function values. This fact relies on an accurate choice of the element of the subdifferential used for the update, and on proper actions adopted when non-differentiability regions are crossed. Then, we propose an accelerated version of the algorithm, based on conservative inertial dynamics and on an adaptive restart strategy, that is guaranteed to achieve a linear convergence rate in the strongly convex case. Finally, we test the performances of our algorithms on some strongly and non-strongly convex examples.

In this paper, we address the following problem: Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs. We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset. Its cumulative Bayesian regret goes down to zero exponentially fast in N, the offline dataset size if the expert is competent enough. Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning. Our empirical results show that the proposed iRLSVI algorithm is able to achieve significant reduction in regret as compared to two baselines: no offline data, and offline dataset but used without information about the generative policy. Our algorithm bridges online RL and imitation learning for the first time.

We evaluate benchmark deep reinforcement learning (DRL) algorithms on the task of portfolio optimisation under a simulator. The simulator is based on correlated geometric Brownian motion (GBM) with the Bertsimas-Lo (BL) market impact model. Using the Kelly criterion (log utility) as the objective, we can analytically derive the optimal policy without market impact and use it as an upper bound to measure performance when including market impact. We found that the off-policy algorithms DDPG, TD3 and SAC were unable to learn the right Q function due to the noisy rewards and therefore perform poorly. The on-policy algorithms PPO and A2C, with the use of generalised advantage estimation (GAE), were able to deal with the noise and derive a close to optimal policy. The clipping variant of PPO was found to be important in preventing the policy from deviating from the optimal once converged. In a more challenging environment where we have regime changes in the GBM parameters, we found that PPO, combined with a hidden Markov model (HMM) to learn and predict the regime context, is able to learn different policies adapted to each regime. Overall, we find that the sample complexity of these algorithms is too high, requiring more than 2m steps to learn a good policy in the simplest setting, which is equivalent to almost 8,000 years of daily prices.

This work addresses a version of the two-armed Bernoulli bandit problem where the sum of the means of the arms is one (the symmetric two-armed Bernoulli bandit). In a regime where the gap between these means goes to zero as the number of prediction periods approaches infinity, i.e., the difficulty of detecting the gap increases as the sample size increases, we obtain the leading order terms of the minmax optimal regret and pseudoregret for this problem by associating each of them with a solution of a linear heat equation. Our results improve upon the previously known results; specifically, we explicitly compute these leading order terms in three different scaling regimes for the gap. Additionally, we obtain new non-asymptotic bounds for any given time horizon. Although optimal player strategies are not known for more general bandit problems, there is significant interest in considering how regret accumulates under specific player strategies, even when they are not known to be optimal. We expect that the methods of this paper should be useful in settings of that type.

Efficient learning in multi-armed bandit mechanisms such as pay-per-click (PPC) auctions typically involves three challenges: 1) inducing truthful bidding behavior (incentives), 2) using personalization in the users (context), and 3) circumventing manipulations in click patterns (corruptions). Each of these challenges has been studied orthogonally in the literature; incentives have been addressed by a line of work on truthful multi-armed bandit mechanisms, context has been extensively tackled by contextual bandit algorithms, while corruptions have been discussed via a recent line of work on bandits with adversarial corruptions. Since these challenges co-exist, it is important to understand the robustness of each of these approaches in addressing the other challenges, provide algorithms that can handle all simultaneously, and highlight inherent limitations in this combination. In this work, we show that the most prominent contextual bandit algorithm, $\epsilon$-greedy can be extended to handle the challenges introduced by strategic arms in the contextual multi-arm bandit mechanism setting. We further show that $\epsilon$-greedy is inherently robust to adversarial data corruption attacks and achieves performance that degrades linearly with the amount of corruption.

The last decade has seen many attempts to generalise the definition of modes, or MAP estimators, of a probability distribution $\mu$ on a space $X$ to the case that $\mu$ has no continuous Lebesgue density, and in particular to infinite-dimensional Banach and Hilbert spaces $X$. This paper examines the properties of and connections among these definitions. We construct a systematic taxonomy -- or `periodic table' -- of modes that includes the established notions as well as large hitherto-unexplored classes. We establish implications between these definitions and provide counterexamples to distinguish them. We also distinguish those definitions that are merely `grammatically correct' from those that are `meaningful' in the sense of satisfying certain `common-sense' axioms for a mode, among them the correct handling of discrete measures and those with continuous Lebesgue densities. However, despite there being 17 such `meaningful' definitions of mode, we show that none of them satisfy the `merging property', under which the modes of $\mu|_{A}$, $\mu|_{B}$ and $\mu|_{A \cup B}$ enjoy a straightforward relationship for well-separated positive-mass events $A,B \subseteq X$.

We introduce a new quantum algorithm for computing the Betti numbers of a simplicial complex. In contrast to previous quantum algorithms that work by estimating the eigenvalues of the combinatorial Laplacian, our algorithm is an instance of the generic Incremental Algorithm for computing Betti numbers that incrementally adds simplices to the simplicial complex and tests whether or not they create a cycle. In contrast to existing quantum algorithms for computing Betti numbers that work best when the complex has close to the maximal number of simplices, our algorithm works best for sparse complexes. To test whether a simplex creates a cycle, we introduce a quantum span-program algorithm. We show that the query complexity of our span program is parameterized by quantities called the effective resistance and effective capacitance of the boundary of the simplex. Unfortunately, we also prove upper and lower bounds on the effective resistance and capacitance, showing both quantities can be exponentially large with respect to the size of the complex, implying that our algorithm would have to run for exponential time to exactly compute Betti numbers. However, as a corollary to these bounds, we show that the spectral gap of the combinatorial Laplacian can be exponentially small. As the runtime of all previous quantum algorithms for computing Betti numbers are parameterized by the inverse of the spectral gap, our bounds show that all quantum algorithms for computing Betti numbers must run for exponentially long to exactly compute Betti numbers. Finally, we prove some novel formulas for effective resistance and effective capacitance to give intuition for these quantities.

北京阿比特科技有限公司