We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For these three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure. Furthermore, when the online environments encountered by the learner are predictable, we design improved algorithms and achieve better dynamic regret bounds for the episodic (loop-free) SSP; and moreover, we demonstrate impossibility results for the infinite-horizon MDPs.
A reliable supply with electric power is vital for our society. Transmission line failures are among the biggest threats for power grid stability as they may lead to a splitting of the grid into mutual asynchronous fragments. New conceptual methods are needed to assess system stability that complement existing simulation models. In this article we propose a combination of network science metrics and machine learning models to predict the risk of desynchronisation events. Network science provides metrics for essential properties of transmission lines such as their redundancy or centrality. Machine learning models perform inherent feature selection and thus reveal key factors that determine network robustness and vulnerability. As a case study, we train and test such models on simulated data from several synthetic test grids. We find that the integrated models are capable of predicting desynchronisation events after line failures with an average precision greater than $0.996$ when averaging over all data sets. Learning transfer between different data sets is generally possible, at a slight loss of prediction performance. Our results suggest that power grid desynchronisation is essentially governed by only a few network metrics that quantify the networks ability to reroute flow without creating exceedingly high static line loadings.
We consider the problem of fairly allocating items to a set of individuals, when the items are arriving online. A central solution concept in fair allocation is competitive equilibrium: every individual is endowed with a budget of faux currency, and the resulting competitive equilibrium is used to allocate. For the online fair allocation context, the PACE algorithm of Gao et al. [2021] leverages the dual averaging algorithm to approximate competitive equilibria. The authors show that, when items arrive i.i.d, the algorithm asymptotically achieves the fairness and efficiency guarantees of the offline competitive equilibrium allocation. However, real-world data is typically not stationary. One could instead model the data as adversarial, but this is often too pessimistic in practice. Motivated by this consideration, we study an online fair allocation setting with nonstationary item arrivals. To address this setting, we first develop new online learning results for the dual averaging algorithm under nonstationary input models. We show that the dual averaging iterates converge in mean square to both the underlying optimal solution of the "true" stochastic optimization problem as well as the "hindsight" optimal solution of the finite-sum problem given by the sample path. Our results apply to several nonstationary input models: adversarial corruption, ergodic input, and block-independent (including periodic) input. Here, the bound on the mean square error depends on a nonstationarity measure of the input. We recover the classical bound when the input data is i.i.d. We then show that our dual averaging results imply that the PACE algorithm for online fair allocation simultaneously achieves "best of both worlds" guarantees against any of these input models. Finally, we conduct numerical experiments which show strong empirical performance against nonstationary inputs.
CVaR (Conditional Value at Risk) is a risk metric widely used in finance. However, dynamically optimizing CVaR is difficult since it is not a standard Markov decision process (MDP) and the principle of dynamic programming fails. In this paper, we study the infinite-horizon discrete-time MDP with a long-run CVaR criterion, from the view of sensitivity-based optimization. By introducing a pseudo CVaR metric, we derive a CVaR difference formula which quantifies the difference of long-run CVaR under any two policies. The optimality of deterministic policies is derived. We obtain a so-called Bellman local optimality equation for CVaR, which is a necessary and sufficient condition for local optimal policies and only necessary for global optimal policies. A CVaR derivative formula is also derived for providing more sensitivity information. Then we develop a policy iteration type algorithm to efficiently optimize CVaR, which is shown to converge to local optima in the mixed policy space. We further discuss some extensions including the mean-CVaR optimization and the maximization of CVaR. Finally, we conduct numerical experiments relating to portfolio management to demonstrate the main results. Our work may shed light on dynamically optimizing CVaR from a sensitivity viewpoint.
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves -- by at least a factor of $S^5A^3$ -- upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called {\em reference-advantage decomposition}), the proposed algorithm employs an {\em early-settled} reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.
We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently proposed proximal causal inference framework, we develop a non-parametric identification result for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables. We then develop a fitted-Q-evaluation-type algorithm to estimate V-bridge functions recursively, where a non-parametric instrumental variable (NPIV) problem is solved at each step. By analyzing this challenging sequential NPIV problem, we establish the finite-sample error bounds for estimating the V-bridge functions and accordingly that for evaluating the policy value, in terms of the sample size, length of horizon and so-called (local) measure of ill-posedness at each step. To the best of our knowledge, this is the first finite-sample error bound for OPE in POMDPs under non-parametric models.
As a powerful Bayesian non-parameterized algorithm, the Gaussian process (GP) has performed a significant role in Bayesian optimization and signal processing. GPs have also advanced online decision-making systems because their posterior distribution has a closed-form solution. However, its training and inference process requires all historic data to be stored and the GP model to be trained from scratch. For those reasons, several online GP algorithms, such as O-SGPR and O-SVGP, have been specifically designed for streaming settings. In this paper, we present a new theoretical framework for online GPs based on the online probably approximately correct (PAC) Bayes theory. The framework offers both a guarantee of generalized performance and good accuracy. Instead of minimizing the marginal likelihood, our algorithm optimizes both the empirical risk function and a regularization item, which is in proportion to the divergence between the prior distribution and posterior distribution of parameters. In addition to its theoretical appeal, the algorithm performs well empirically on several regression datasets. Compared to other online GP algorithms, ours yields a generalization guarantee and very competitive accuracy.
A mixture of experts models the conditional density of a response variable using a mixture of regression models with covariate-dependent mixture weights. We extend the finite mixture of experts model by allowing the parameters in both the mixture components and the weights to evolve in time by following random walk processes. Inference for time-varying parameters in richly parameterized mixture of experts models is challenging. We propose a sequential Monte Carlo algorithm for online inference and based on a tailored proposal distribution built on ideas from linear Bayes methods and the EM algorithm. The method gives a unified treatment for mixtures with time-varying parameters, including the special case of static parameters. We assess the properties of the method on simulated data and on industrial data where the aim is to predict software faults in a continuously upgraded large-scale software project.
We introduce a simple but general online learning framework in which a learner plays against an adversary in a vector-valued game that changes every round. Even though the learner's objective is not convex-concave (and so the minimax theorem does not apply), we give a simple algorithm that can compete with the setting in which the adversary must announce their action first, with optimally diminishing regret. We demonstrate the power of our framework by using it to (re)derive optimal bounds and efficient algorithms across a variety of domains, ranging from multicalibration to a large set of no regret algorithms, to a variant of Blackwell's approachability theorem for polytopes with fast convergence rates. As a new application, we show how to ``(multi)calibeat'' an arbitrary collection of forecasters -- achieving an exponentially improved dependence on the number of models we are competing against, compared to prior work.
Conformal prediction is a widely used method to quantify uncertainty in settings where the data is independent and identically distributed (IID), or more generally, exchangeable. Conformal prediction takes in a pre-trained classifier, a calibration dataset and a confidence level as inputs, and returns a function which maps feature vectors to subsets of classes. The output of the returned function for a new feature vector (i.e., a test data point) is guaranteed to contain the true class with the pre-specified confidence. Despite its success and usefulness in IID settings, extending conformal prediction to non-exchangeable (e.g., Markovian) data in a manner that provably preserves all desirable theoretical properties has largely remained an open problem. As a solution, we extend conformal prediction to the setting of a Hidden Markov Model (HMM) with unknown parameters. The key idea behind the proposed method is to partition the non-exchangeable Markovian data from the HMM into exchangeable blocks by exploiting the de Finetti's Theorem for Markov Chains discovered by Diaconis and Freedman (1980). The permutations of the exchangeable blocks are then viewed as randomizations of the observed Markovian data from the HMM. The proposed method provably retains all desirable theoretical guarantees offered by the classical conformal prediction framework.
Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.