A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.
This paper explores minimum sensing navigation of robots in environments cluttered with obstacles. The general objective is to find a path plan to a goal region that requires minimal sensing effort. In [1], the information-geometric RRT* (IG-RRT*) algorithm was proposed to efficiently find such a path. However, like any stochastic sampling-based planner, the computational complexity of IG-RRT* grows quickly, impeding its use with a large number of nodes. To remedy this limitation, we suggest running IG-RRT* with a moderate number of nodes, and then using a smoothing algorithm to adjust the path obtained. To develop a smoothing algorithm, we explicitly formulate the minimum sensing path planning problem as an optimization problem. For this formulation, we introduce a new safety constraint to impose a bound on the probability of collision with obstacles in continuous-time, in contrast to the common discrete-time approach. The problem is amenable to solution via the convex-concave procedure (CCP). We develop a CCP algorithm for the formulated optimization and use this algorithm for path smoothing. We demonstrate the efficacy of the proposed approach through numerical simulations.
This work introduces efficient symbolic algorithms for quantitative reactive synthesis. We consider resource-constrained robotic manipulators that need to interact with a human to achieve a complex task expressed in linear temporal logic. Our framework generates reactive strategies that not only guarantee task completion but also seek cooperation with the human when possible. We model the interaction as a two-player game and consider regret-minimizing strategies to encourage cooperation. We use symbolic representation of the game to enable scalability. For synthesis, we first introduce value iteration algorithms for such games with min-max objectives. Then, we extend our method to the regret-minimizing objectives. Our benchmarks reveal that our symbolic framework not only significantly improves computation time (up to an order of magnitude) but also can scale up to much larger instances of manipulation problems with up to 2x number of objects and locations than the state of the art.
Non-convex sampling is a key challenge in machine learning, central to non-convex optimization in deep learning as well as to approximate probabilistic inference. Despite its significance, theoretically there remain many important challenges: Existing guarantees (1) typically only hold for the averaged iterates rather than the more desirable last iterates, (2) lack convergence metrics that capture the scales of the variables such as Wasserstein distances, and (3) mainly apply to elementary schemes such as stochastic gradient Langevin dynamics. In this paper, we develop a new framework that lifts the above issues by harnessing several tools from the theory of dynamical systems. Our key result is that, for a large class of state-of-the-art sampling schemes, their last-iterate convergence in Wasserstein distances can be reduced to the study of their continuous-time counterparts, which is much better understood. Coupled with standard assumptions of MCMC sampling, our theory immediately yields the last-iterate Wasserstein convergence of many advanced sampling schemes such as proximal, randomized mid-point, and Runge-Kutta integrators. Beyond existing methods, our framework also motivates more efficient schemes that enjoy the same rigorous guarantees.
Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We then generalize regularized MDPs to twice regularized MDPs ($\text{R}^2$ MDPs), i.e., MDPs with $\textit{both}$ value and policy regularization. The corresponding Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees, thus reducing robustness to regularization. We numerically show this two-fold advantage on tabular and physical domains, highlighting the fact that $\text{R}^2$ preserves its efficacy in continuous environments.
This paper presents a new approach for the estimation and inference of the regression parameters in a panel data model with interactive fixed effects. It relies on the assumption that the factor loadings can be expressed as an unknown smooth function of the time average of covariates plus an idiosyncratic error term. Compared to existing approaches, our estimator has a simple partial least squares form and does neither require iterative procedures nor the previous estimation of factors. We derive its asymptotic properties by finding out that the limiting distribution has a discontinuity, depending on the explanatory power of our basis functions which is expressed by the variance of the error of the factor loadings. As a result, the usual ``plug-in" methods based on estimates of the asymptotic covariance are only valid pointwise and may produce either over- or under-coverage probabilities. We show that uniformly valid inference can be achieved by using the cross-sectional bootstrap. A Monte Carlo study indicates good performance in terms of mean squared error. We apply our methodology to analyze the determinants of growth rates in OECD countries.
Numerical predictions of quantities of interest measured within physical systems rely on the use of mathematical models that should be validated, or at best, not invalidated. Model validation usually involves the comparison of experimental data (outputs from the system of interest) and model predictions, both obtained at a specific validation scenario. The design of this validation experiment should be directly relevant to the objective of the model, that of predicting a quantity of interest at a prediction scenario. In this paper, we address two specific issues arising when designing validation experiments. The first issue consists in determining an appropriate validation scenario in cases where the prediction scenario cannot be carried out in a controlled environment. The second issue concerns the selection of observations when the quantity of interest cannot be readily observed. The proposed methodology involves the computation of influence matrices that characterize the response surface of given model functionals. Minimization of the distance between influence matrices allow one for selecting a validation experiment most representative of the prediction scenario. We illustrate our approach on two numerical examples. The first example considers the validation of a simple model based on an ordinary differential equation governing an object in free fall to put in evidence the importance of the choice of the validation experiment. The second numerical experiment focuses on the transport of a pollutant and demonstrates the impact that the choice of the quantity of interest has on the validation experiment to be performed.
As data-driven methods are deployed in real-world settings, the processes that generate the observed data will often react to the decisions of the learner. For example, a data source may have some incentive for the algorithm to provide a particular label (e.g. approve a bank loan), and manipulate their features accordingly. Work in strategic classification and decision-dependent distributions seeks to characterize the closed-loop behavior of deploying learning algorithms by explicitly considering the effect of the classifier on the underlying data distribution. More recently, works in performative prediction seek to classify the closed-loop behavior by considering general properties of the mapping from classifier to data distribution, rather than an explicit form. Building on this notion, we analyze repeated risk minimization as the perturbed trajectories of the gradient flows of performative risk minimization. We consider the case where there may be multiple local minimizers of performative risk, motivated by situations where the initial conditions may have significant impact on the long-term behavior of the system. We provide sufficient conditions to characterize the region of attraction for the various equilibria in this settings. Additionally, we introduce the notion of performative alignment, which provides a geometric condition on the convergence of repeated risk minimization to performative risk minimizers.
We study online learning problems in which a decision maker wants to maximize their expected reward without violating a finite set of $m$ resource constraints. By casting the learning process over a suitably defined space of strategy mixtures, we recover strong duality on a Lagrangian relaxation of the underlying optimization problem, even for general settings with non-convex reward and resource-consumption functions. Then, we provide the first best-of-many-worlds type framework for this setting, with no-regret guarantees under stochastic, adversarial, and non-stationary inputs. Our framework yields the same regret guarantees of prior work in the stochastic case. On the other hand, when budgets grow at least linearly in the time horizon, it allows us to provide a constant competitive ratio in the adversarial case, which improves over the best known upper bound bound of $O(\log m \log T)$. Moreover, our framework allows the decision maker to handle non-convex reward and cost functions. We provide two game-theoretic applications of our framework to give further evidence of its flexibility. In doing so, we show that it can be employed to implement budget-pacing mechanisms in repeated first-price auctions.
This paper is concerned with the convergence of a series associated with a certain version of the convexification method. That version has been recently developed by the research group of the first author for solving coefficient inverse problems. The convexification method aims to construct a globally convex Tikhonov-like functional with a Carleman Weight Function in it. In the previous works the construction of the strictly convex weighted Tikhonov-like functional assumes a truncated Fourier series (i.e. a finite series instead of an infinite one) for a function generated by the total wave field. In this paper we prove a convergence property for this truncated Fourier series approximation. More precisely, we show that the residual of the approximate PDE obtained by using the truncated Fourier series tends to zero in $L^{2}$ as the truncation index in the truncated Fourier series tends to infinity. The proof relies on a convergence result in the $H^{1}$-norm for a sequence of $L^{2}$-orthogonal projections on finite-dimensional subspaces spanned by elements of a special Fourier basis. However, due to the ill-posed nature of coefficient inverse problems, we cannot prove that the solution of that approximate PDE, which results from the minimization of that Tikhonov-like functional, converges to the correct solution.
We investigate the possibility of solving continuous non-convex optimization problems using a network of interacting quantum optical oscillators. We propose a native encoding of continuous variables in analog signals associated with the quadrature operators of a set of quantum optical modes. Optical coupling of the modes and noise introduced by vacuum fluctuations from external reservoirs or by weak measurements of the modes are used to optically simulate a diffusion process on a set of continuous random variables. The process is run sufficiently long for it to relax into the steady state of an energy potential defined on a continuous domain. As a first demonstration, we numerically benchmark solving box-constrained quadratic programming (BoxQP) problems using these settings. We consider delay-line and measurement-feedback variants of the experiment. Our benchmarking results demonstrate that in both cases the optical network is capable of solving BoxQP problems over three orders of magnitude faster than a state-of-the-art classical heuristic.