日本人体黄色三级视频,亚洲中文字幕精品久在线不

This paper introduces and studies the $k\texttt{-experts}$ problem -- a generalization of the classic Prediction with Expert's Advice (i.e., the $\texttt{Experts}$) problem. Unlike the $\texttt{Experts}$ problem, where the learner chooses exactly one expert, in this problem, the learner selects a subset of $k$ experts from a pool of $N$ experts at each round. The reward obtained by the learner at any round depends on the rewards of the selected experts. The $k\texttt{-experts}$ problem arises in many practical settings, including online ad placements, personalized news recommendations, and paging. Our primary goal is to design an online learning policy having a small regret. In this pursuit, we propose $\texttt{SAGE}$ ($\textbf{Sa}$mpled Hed$\textbf{ge}$) - a framework for designing efficient online learning policies by leveraging statistical sampling techniques. We show that, for many related problems, $\texttt{SAGE}$ improves upon the state-of-the-art bounds for regret and computational complexity. Furthermore, going beyond the notion of regret, we characterize the mistake bounds achievable by online learning policies for a class of stable loss functions. We conclude the paper by establishing a tight regret lower bound for a variant of the $k\texttt{-experts}$ problem and carrying out experiments with standard datasets.

相關內容

學習器

關注 1

穩健性 · 回合 · 控制器 · 強化學習 · 學成 ·

2021 年 12 月 9 日

Robustifying Reinforcement Learning Policies with $\mathcal{L}_1$ Adaptive Control

Yikun Cheng,Pan Zhao,Manan Gandhi,Bo Li,Evangelos Theodorou,Naira Hovakimyan

from arxiv, A significantly extended version of this paper has been uploaded to arXiv. arXiv:2112.01953

A reinforcement learning (RL) policy trained in a nominal environment could fail in a new/perturbed environment due to the existence of dynamic variations. Existing robust methods try to obtain a fixed policy for all envisioned dynamic variation scenarios through robust or adversarial training. These methods could lead to conservative performance due to emphasis on the worst case, and often involve tedious modifications to the training environment. We propose an approach to robustifying a pre-trained non-robust RL policy with $\mathcal{L}_1$ adaptive control. Leveraging the capability of an $\mathcal{L}_1$ control law in the fast estimation of and active compensation for dynamic variations, our approach can significantly improve the robustness of an RL policy trained in a standard (i.e., non-robust) way, either in a simulator or in the real world. Numerical experiments are provided to validate the efficacy of the proposed approach.

估計/估計量 · 社會媒體處理 · 對數幾率回歸 · 優化器 · 線性模型 ·

2021 年 12 月 8 日

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

Jaouad Mourtada,Stéphane Ga?ffas

from arxiv, 43 pages, minor revision

We introduce a procedure for conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for statistical learning. On standard examples, this bound scales as $d/n$ with $d$ the model dimension and $n$ the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal $\log n$ factors and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of $O((d + B^2R^2)/n)$, where $R$ bounds the norm of features and $B$ that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than $\min({B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$ in general. This provides a more practical alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly addressing a question raised by Foster et al. (2018).

可約的 · 優化器 · 在線 · Notability · INFORMS ·

2021 年 12 月 7 日

Gradient and Projection Free Distributed Online Min-Max Resource Optimization

Jingrong Wang,Ben Liang

We consider distributed online min-max resource allocation with a set of parallel agents and a parameter server. Our goal is to minimize the pointwise maximum over a set of time-varying convex and decreasing cost functions, without a priori information about these functions. We propose a novel online algorithm, termed Distributed Online resource Re-Allocation (DORA), where non-stragglers learn to relinquish resource and share resource with stragglers. A notable feature of DORA is that it does not require gradient calculation or projection operation, unlike most existing online optimization strategies. This allows it to substantially reduce the computation overhead in large-scale and distributed networks. We show that the dynamic regret of the proposed algorithm is upper bounded by $O\left(T^{\frac{3}{4}}(1+P_T)^{\frac{1}{4}}\right)$, where $T$ is the total number of rounds and $P_T$ is the path-length of the instantaneous minimizers. We further consider an application to the bandwidth allocation problem in distributed online machine learning. Our numerical study demonstrates the efficacy of the proposed solution and its performance advantage over gradient- and/or projection-based resource allocation algorithms in reducing wall-clock time.

穩健性 · 控制器 · 強化學習 · 學成 · 回合 ·

2021 年 12 月 7 日

Improving the Robustness of Reinforcement Learning Policies with $\mathcal{L}_{1}$ Adaptive Control

Y. Cheng,P. Zhao,F. Wang,D. J. Block,N. Hovakimyan

from arxiv, arXiv admin note: substantial text overlap with arXiv:2106.02249

A reinforcement learning (RL) control policy trained in a nominal environment could fail in a new/perturbed environment due to the existence of dynamic variations. For controlling systems with continuous state and action spaces, we propose an add-on approach to robustifying a pre-trained RLpolicy by augmenting it with an $\mathcal{L}_{1}$ adaptive controller ($ \mathcal{L}_{1}$AC). Leveraging the capability of an $\mathcal{L}_{1}$AC for fast estimation and active compensation of dynamic variations, the proposed approach can improve the robustness of an RL policy which is trained either in a simulator or in the real world without consideration of a broad class of dynamic variations. Numerical and real-world experiments empirically demonstrate the efficacy of the proposed approach in robustifying RL policies trained using both model-free and model-based methods. A video for the experiments on a real Pendubot setup is availableat//youtu.be/xgOB9vpyUgE.

近似 · 對偶問題 · 主問題 · 凸集 · 直徑 ·

2021 年 12 月 5 日

Revisiting the Approximate Carathéodory Problem via the Frank-Wolfe Algorithm

Cyrille W. Combettes,Sebastian Pokutta

from arxiv, Final version MAPR

The approximate Carath\'eodory theorem states that given a compact convex set $\mathcal{C}\subset\mathbb{R}^n$ and $p\in\left[2,+\infty\right[$, each point $x^*\in\mathcal{C}$ can be approximated to $\epsilon$-accuracy in the $\ell_p$-norm as the convex combination of $\mathcal{O}(pD_p^2/\epsilon^2)$ vertices of $\mathcal{C}$, where $D_p$ is the diameter of $\mathcal{C}$ in the $\ell_p$-norm. A solution satisfying these properties can be built using probabilistic arguments or by applying mirror descent to the dual problem. We revisit the approximate Carath\'eodory problem by solving the primal problem via the Frank-Wolfe algorithm, providing a simplified analysis and leading to an efficient practical method. Furthermore, improved cardinality bounds are derived naturally using existing convergence rates of the Frank-Wolfe algorithm in different scenarios, when $x^*$ is in the interior of $\mathcal{C}$, when $x^*$ is the convex combination of a subset of vertices with small diameter, or when $\mathcal{C}$ is uniformly convex. We also propose cardinality bounds when $p\in\left[1,2\right[\cup\{+\infty\}$ via a nonsmooth variant of the algorithm. Lastly, we address the problem of finding sparse approximate projections onto $\mathcal{C}$ in the $\ell_p$-norm, $p\in\left[1,+\infty\right]$.

優化器 · 離散化 · 值域 · 泛化理論 · 非凸 ·

2021 年 12 月 2 日

Breaking the Convergence Barrier: Optimization via Fixed-Time Convergent Flows

Param Budhraja,Mayank Baranwal,Kunal Garg,Ashish Hota

from arxiv, Accepted at AAAI Conference on Artificial Intelligence, 2022, to appear

Accelerated gradient methods are the cornerstones of large-scale, data-driven optimization problems that arise naturally in machine learning and other fields concerning data analysis. We introduce a gradient-based optimization framework for achieving acceleration, based on the recently introduced notion of fixed-time stability of dynamical systems. The method presents itself as a generalization of simple gradient-based methods suitably scaled to achieve convergence to the optimizer in a fixed-time, independent of the initialization. We achieve this by first leveraging a continuous-time framework for designing fixed-time stable dynamical systems, and later providing a consistent discretization strategy, such that the equivalent discrete-time algorithm tracks the optimizer in a practically fixed number of iterations. We also provide a theoretical analysis of the convergence behavior of the proposed gradient flows, and their robustness to additive disturbances for a range of functions obeying strong convexity, strict convexity, and possibly nonconvexity but satisfying the Polyak-{\L}ojasiewicz inequality. We also show that the regret bound on the convergence rate is constant by virtue of the fixed-time convergence. The hyperparameters have intuitive interpretations and can be tuned to fit the requirements on the desired convergence rates. We validate the accelerated convergence properties of the proposed schemes on a range of numerical examples against the state-of-the-art optimization algorithms. Our work provides insights on developing novel optimization algorithms via discretization of continuous-time flows.

鞍點 · SimPLe · 駐點 · 平穩的 · 冪法 ·

2021 年 11 月 28 日

Escape saddle points by a simple gradient-descent based algorithm

Chenyi Zhang,Tongyang Li

from arxiv, 34 pages, 8 figures, to appear in the 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function $f\colon\mathbb{R}^n\to\mathbb{R}$, it outputs an $\epsilon$-approximate second-order stationary point in $\tilde{O}(\log n/\epsilon^{1.75})$ iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with $\tilde{O}((\log n)^{4}/\epsilon^{2})$ or $\tilde{O}((\log n)^{6}/\epsilon^{1.75})$ iterations, our algorithm is polynomially better in terms of $\log n$ and matches their complexities in terms of $1/\epsilon$. For the stochastic setting, our algorithm outputs an $\epsilon$-approximate second-order stationary point in $\tilde{O}((\log n)^{2}/\epsilon^{4})$ iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in $\log n$ compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.

泛化理論 · UniFormer · 未標記 · TOOLS · 可辨認的 ·

2021 年 10 月 17 日

Explaining generalization in deep learning: progress and fundamental limits

Vaishnavh Nagarajan

from arxiv, arXiv admin note: text overlap with arXiv:1902.04742

This dissertation studies a fundamental open challenge in deep learning theory: why do deep networks generalize well even while being overparameterized, unregularized and fitting the training data to zero error? In the first part of the thesis, we will empirically study how training deep networks via stochastic gradient descent implicitly controls the networks' capacity. Subsequently, to show how this leads to better generalization, we will derive {\em data-dependent} {\em uniform-convergence-based} generalization bounds with improved dependencies on the parameter count. Uniform convergence has in fact been the most widely used tool in deep learning literature, thanks to its simplicity and generality. Given its popularity, in this thesis, we will also take a step back to identify the fundamental limits of uniform convergence as a tool to explain generalization. In particular, we will show that in some example overparameterized settings, {\em any} uniform convergence bound will provide only a vacuous generalization bound. With this realization in mind, in the last part of the thesis, we will change course and introduce an {\em empirical} technique to estimate generalization using unlabeled data. Our technique does not rely on any notion of uniform-convergece-based complexity and is remarkably precise. We will theoretically show why our technique enjoys such precision. We will conclude by discussing how future work could explore novel ways to incorporate distributional assumptions in generalization bounds (such as in the form of unlabeled data) and explore other tools to derive bounds, perhaps by modifying uniform convergence or by developing completely new tools altogether.

學成 · 替代損失 · 在線 · Bandits · 賭博機/老虎機 ·

2019 年 12 月 31 日

A Modern Introduction to Online Learning

Francesco Orabona

In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are dealt through convex surrogate losses and through randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. These notes do not require prior knowledge of convex analysis and all the required mathematical tools are rigorously explained. Moreover, all the proofs have been carefully chosen to be as simple and as short as possible.

優化器 · 強化學習 · 學成 · state-of-the-art · SimPLe ·

2018 年 7 月 25 日

Variational Bayesian Reinforcement Learning with Regret Bounds

Brendan O'Donoghue

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.