We propose an algorithm that uses linear function approximation (LFA) for stochastic shortest path (SSP). Under minimal assumptions, it obtains sublinear regret, is computationally efficient, and uses stationary policies. To our knowledge, this is the first such algorithm in the LFA literature (for SSP or other formulations). Our algorithm is a special case of a more general one, which achieves regret square root in the number of episodes given access to a certain computation oracle.
This work provides a theoretical framework for the pose estimation problem using total least squares for vector observations from landmark features. First, the optimization framework is formulated with observation vectors extracted from point cloud features. Then, error-covariance expressions are derived. The attitude and position solutions obtained via the derived optimization framework are proven to reach the bounds defined by the Cram\'er-Rao lower bound under the small-angle approximation of attitude errors. The measurement data for the simulation of this problem is provided through a series of vector observation scans, and a fully populated observation noise-covariance matrix is assumed as the weight in the cost function to cover the most general case of the sensor uncertainty. Here, previous derivations are expanded for the pose estimation problem to include more generic correlations in the errors than previous cases involving an isotropic noise assumption. The proposed solution is simulated in a Monte-Carlo framework to validate the error-covariance analysis.
We revisit the finite time analysis of policy gradient methods in the one of the simplest settings: finite state and action MDPs with a policy class consisting of all stochastic policies and with exact gradient evaluations. There has been some recent work viewing this setting as an instance of smooth non-linear optimization problems and showing sub-linear convergence rates with small step-sizes. Here, we take a different perspective based on connections with policy iteration and show that many variants of policy gradient methods succeed with large step-sizes and attain a linear rate of convergence.
In this paper we prove the efficacy of a simple greedy algorithm for a finite horizon online resource allocation/matching problem, when the corresponding static planning linear program (SPP) exhibits a non-degeneracy condition called the general position gap (GPG). The key intuition that we formalize is that the solution of the reward maximizing SPP is the same as a feasibility LP restricted to the optimal basic activities, and under GPG this solution can be tracked with bounded regret by a greedy algorithm, i.e., without the commonly used technique of periodically resolving the SPP. The goal of the decision maker is to combine resources (from a finite set of resource types) into configurations (from a finite set of feasible configurations) where each configuration is specified by the number of resources consumed of each type and a reward. The resources are further subdivided into three types - offline (whose quantity is known and available at time 0), online-queueable (which arrive online and can be stored in a buffer), and online-nonqueueable (which arrive online and must be matched on arrival or lost). Under GRG we prove that, (i) our greedy algorithm gets bounded any-time regret of $\mathcal{O}(1/\epsilon_0)$ for matching reward ($\epsilon_0$ is a measure of the GPG) when no configuration contains both an online-queueable and an online-nonqueueable resource, and (ii) $\mathcal{O}(\log t)$ expected any-time regret otherwise (we also prove a matching lower bound). By considering the three types of resources, our matching framework encompasses several well-studied problems such as dynamic multi-sided matching, network revenue management, online stochastic packing, and multiclass queueing systems.
In this paper we study a multi-class, multi-server queueing system with stochastic rewards of job-server assignments following a bilinear model in feature vectors representing jobs and servers. Our goal is regret minimization against an oracle policy that has a complete information about system parameters. We propose a scheduling algorithm that uses a linear bandit algorithm along with dynamic allocation of jobs to servers. For the baseline setting, in which mean job service times are identical for all jobs, we show that our algorithm has a sub-linear regret, as well as a sub-linear bound on the mean queue length, in the horizon time. We further show that similar bounds hold under more general assumptions, allowing for non-identical mean job service times for different job classes and a time-varying set of server classes. We also show that better regret and mean queue length bounds can be guaranteed by an algorithm having access to traffic intensities of job classes. We present results of numerical experiments demonstrating how regret and mean queue length of our algorithms depend on various system parameters and compare their performance against a previously proposed algorithm using synthetic randomly generated data and a real-world cluster computing data trace.
Bayesian persuasion is a model for understanding strategic information revelation: an agent with an informational advantage, called a sender, strategically discloses information by sending signals to another agent, called a receiver. In algorithmic Bayesian persuasion, we are interested in efficiently designing the sender's signaling schemes that lead the receiver to take action in favor of the sender. This paper studies algorithmic Bayesian-persuasion settings where the receiver's feasible actions are specified by combinatorial constraints, e.g., matroids or paths in graphs. We first show that constant-factor approximation is NP-hard even in some special cases of matroids or paths. We then propose a polynomial-time algorithm for general matroids by assuming the number of states of nature to be a constant. We finally consider a relaxed notion of persuasiveness, called CCE-persuasiveness, and present a sufficient condition for polynomial-time approximability.
Given a set P of n points in the plane, the unit-disk graph G_{r}(P) with respect to a parameter r is an undirected graph whose vertex set is P such that an edge connects two points p, q \in P if the Euclidean distance between p and q is at most r (the weight of the edge is 1 in the unweighted case and is the distance between p and q in the weighted case). Given a value \lambda>0 and two points s and t of P, we consider the following reverse shortest path problem: computing the smallest r such that the shortest path length between s and t in G_r(P) is at most \lambda. In this paper, we present an algorithm of O(\lfloor \lambda \rfloor \cdot n \log n) time and another algorithm of O(n^{5/4} \log^{7/4} n) time for the unweighted case, as well as an O(n^{5/4} \log^{5/2} n) time algorithm for the weighted case. We also consider the L_1 version of the problem where the distance of two points is measured by the L_1 metric; we solve the problem in O(n \log^3 n) time for both the unweighted and weighted cases.
Spatially inhomogeneous functions, which may be smooth in some regions and rough in other regions, are modelled naturally in a Bayesian manner using so-called Besov priors which are given by random wavelet expansions with Laplace-distributed coefficients. This paper studies theoretical guarantees for such prior measures - specifically, we examine their frequentist posterior contraction rates in the setting of non-linear inverse problems with Gaussian white noise. Our results are first derived under a general local Lipschitz assumption on the forward map. We then verify the assumption for two non-linear inverse problems arising from elliptic partial differential equations, the Darcy flow model from geophysics as well as a model for the Schr\"odinger equation appearing in tomography. In the course of the proofs, we also obtain novel concentration inequalities for penalized least squares estimators with $\ell^1$ wavelet penalty, which have a natural interpretation as maximum a posteriori (MAP) estimators. The true parameter is assumed to belong to some spatially inhomogeneous Besov class $B^{\alpha}_{11}$, $\alpha>0$. In a setting with direct observations, we complement these upper bounds with a lower bound on the rate of contraction for arbitrary Gaussian priors. An immediate consequence of our results is that while Laplace priors can achieve minimax-optimal rates over $B^{\alpha}_{11}$-classes, Gaussian priors are limited to a (by a polynomial factor) slower contraction rate. This gives information-theoretical justification for the intuition that Laplace priors are more compatible with $\ell^1$ regularity structure in the underlying parameter.
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate $\tilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions, and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$, which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first (nearly) horizon-free regret bound beyond the finite-horizon MDP setting.
Let $P$ be a set of points in $\mathbb{R}^d$, where each point $p\in P$ has an associated transmission range $\rho(p)$. The range assignment $\rho$ induces a directed communication graph $\mathcal{G}_{\rho}(P)$ on $P$, which contains an edge $(p,q)$ iff $|pq| \leq \rho(p)$. In the broadcast range-assignment problem, the goal is to assign the ranges such that $\mathcal{G}_{\rho}(P)$ contains an arborescence rooted at a designated node and whose cost $\sum_{p \in P} \rho(p)^2$ is minimized. We study trade-offs between the stability of the solution -- the number of ranges that are modified when a point is inserted into or deleted from $P$ -- and its approximation ratio. We introduce $k$-stable algorithms, which are algorithms that modify the range of at most $k$ points when they update the solution. We also introduce the concept of a stable approximation scheme (SAS). A SAS is an update algorithm that, for any given fixed parameter $\varepsilon>0$, is $k(\epsilon)$-stable and maintains a solution with approximation ratio $1+\varepsilon$, where the stability parameter $k(\varepsilon)$ only depends on $\varepsilon$ and not on the size of $P$. We study such trade-offs in three settings. - In $\mathbb{R}^1$, we present a SAS with $k(\varepsilon)=O(1/\varepsilon)$, which we show is tight in the worst case. We also present a 1-stable $(6+2\sqrt{5})$-approximation algorithm, a $2$-stable 2-approximation algorithm, and a $3$-stable $1.97$-approximation algorithm. - In $\mathbb{S}^1$ (where the underlying space is a circle) we prove that no SAS exists, even though an optimal solution can always be obtained by cutting the circle at an appropriate point and solving the resulting problem in $\mathbb{R}^1$. - In $\mathbb{R}^2$, we also prove that no SAS exists, and we present a $O(1)$-stable $O(1)$-approximation algorithm.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.