We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal O}(K^{8/9})$ regret and greatly improves over the best existing bound $\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which could again be of independent interest.
We study the fundamental problem of fairly allocating a set of indivisible goods among $n$ agents with additive valuations using the desirable fairness notion of maximin share (MMS). MMS is the most popular share-based notion, in which an agent finds an allocation fair to her if she receives goods worth at least her MMS value. An allocation is called MMS if all agents receive at least their MMS value. Since MMS allocations need not exist when $n>2$, a series of works showed the existence of approximate MMS allocations with the current best factor of $\frac34 + O(\frac{1}{n})$. However, a simple example in [DFL82, BEF21, AGST23] showed the limitations of existing approaches and proved that they cannot improve this factor to $3/4 + \Omega(1)$. In this paper, we bypass these barriers to show the existence of $(\frac{3}{4} + \frac{3}{3836})$-MMS allocations by developing new reduction rules and analysis techniques.
Motivated by the developing mathematics of deep learning, we build universal functions approximators of continuous maps between arbitrary Polish metric spaces $\mathcal{X}$ and $\mathcal{Y}$ using elementary functions between Euclidean spaces as building blocks. Earlier results assume that the target space $\mathcal{Y}$ is a topological vector space. We overcome this limitation by ``randomization'': our approximators output discrete probability measures over $\mathcal{Y}$. When $\mathcal{X}$ and $\mathcal{Y}$ are Polish without additional structure, we prove very general qualitative guarantees; when they have suitable combinatorial structure, we prove quantitative guarantees for H\"{o}lder-like maps, including maps between finite graphs, solution operators to rough differential equations between certain Carnot groups, and continuous non-linear operators between Banach spaces arising in inverse problems. In particular, we show that the required number of Dirac measures is determined by the combinatorial structure of $\mathcal{X}$ and $\mathcal{Y}$. For barycentric $\mathcal{Y}$, including Banach spaces, $\mathbb{R}$-trees, Hadamard manifolds, or Wasserstein spaces on Polish metric spaces, our approximators reduce to $\mathcal{Y}$-valued functions. When the Euclidean approximators are neural networks, our constructions generalize transformer networks, providing a new probabilistic viewpoint of geometric deep learning.
The known a posteriori error analysis of hybrid high-order methods (HHO) treats the stabilization contribution as part of the error and as part of the error estimator for an efficient and reliable error control. This paper circumvents the stabilization contribution on simplicial meshes and arrives at a stabilization-free error analysis with an explicit residual-based a posteriori error estimator for adaptive mesh-refining as well as an equilibrium-based guaranteed upper error bound (GUB). Numerical evidence in a Poisson model problem supports that the GUB leads to realistic upper bounds for the displacement error in the piecewise energy norm. The adaptive mesh-refining algorithm associated to the explicit residual-based a posteriori error estimator recovers the optimal convergence rates in computational benchmarks.
In this work, we give a statistical characterization of the $\gamma$-regret for arbitrary structured bandit problems, the regret which arises when comparing against a benchmark that is $\gamma$ times the optimal solution. The $\gamma$-regret emerges in structured bandit problems over a function class $\mathcal{F}$ where finding an exact optimum of $f \in \mathcal{F}$ is intractable. Our characterization is given in terms of the $\gamma$-DEC, a statistical complexity parameter for the class $\mathcal{F}$, which is a modification of the constrained Decision-Estimation Coefficient (DEC) of Foster et al., 2023 (and closely related to the original offset DEC of Foster et al., 2021). Our lower bound shows that the $\gamma$-DEC is a fundamental limit for any model class $\mathcal{F}$: for any algorithm, there exists some $f \in \mathcal{F}$ for which the $\gamma$-regret of that algorithm scales (nearly) with the $\gamma$-DEC of $\mathcal{F}$. We provide an upper bound showing that there exists an algorithm attaining a nearly matching $\gamma$-regret. Due to significant challenges in applying the prior results on the DEC to the $\gamma$-regret case, both our lower and upper bounds require novel techniques and a new algorithm.
We study partially linear models in settings where observations are arranged in independent groups but may exhibit within-group dependence. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximising a (restricted) likelihood from random effects modelling or by using generalised estimating equations. We introduce a new 'sandwich loss' whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in linear parameter estimation accuracy when they are not. Under relatively mild conditions, our estimated coefficients are asymptotically Gaussian and enjoy minimal variance among estimators with weights restricted to a given class of functions, when user-chosen regression methods are used to estimate nuisance functions. We further expand the class of functional forms for the weights that may be fitted beyond parametric models by leveraging the flexibility of modern machine learning methods within a new gradient boosting scheme for minimising the sandwich loss. We demonstrate the effectiveness of both the sandwich loss and what we call 'sandwich boosting' in a variety of settings with simulated and real-world data.
Distributional assumptions have been shown to be necessary for the robust learnability of concept classes when considering the exact-in-the-ball robust risk and access to random examples by Gourdeau et al. (2019). In this paper, we study learning models where the learner is given more power through the use of local queries, and give the first distribution-free algorithms that perform robust empirical risk minimization (ERM) for this notion of robustness. The first learning model we consider uses local membership queries (LMQ), where the learner can query the label of points near the training sample. We show that, under the uniform distribution, LMQs do not increase the robustness threshold of conjunctions and any superclass, e.g., decision lists and halfspaces. Faced with this negative result, we introduce the local equivalence query ($\mathsf{LEQ}$) oracle, which returns whether the hypothesis and target concept agree in the perturbation region around a point in the training sample, as well as a counterexample if it exists. We show a separation result: on the one hand, if the query radius $\lambda$ is strictly smaller than the adversary's perturbation budget $\rho$, then distribution-free robust learning is impossible for a wide variety of concept classes; on the other hand, the setting $\lambda=\rho$ allows us to develop robust ERM algorithms. We then bound the query complexity of these algorithms based on online learning guarantees and further improve these bounds for the special case of conjunctions. We finish by giving robust learning algorithms for halfspaces on $\{0,1\}^n$ and then obtaining robustness guarantees for halfspaces in $\mathbb{R}^n$ against precision-bounded adversaries.
We provide an estimator of the covariance matrix that achieves the optimal rate of convergence (up to constant factors) in the operator norm under two standard notions of data contamination: We allow the adversary to corrupt an $\eta$-fraction of the sample arbitrarily, while the distribution of the remaining data points only satisfies that the $L_{p}$-marginal moment with some $p \ge 4$ is equivalent to the corresponding $L_2$-marginal moment. Despite requiring the existence of only a few moments, our estimator achieves the same tail estimates as if the underlying distribution were Gaussian. As a part of our analysis, we prove a dimension-free Bai-Yin type theorem in the regime $p > 4$.
While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.
Canonical models of Markov decision processes (MDPs) usually consider geometric discounting based on a constant discount factor. While this standard modeling approach has led to many elegant results, some recent studies indicate the necessity of modeling time-varying discounting in certain applications. This paper studies a model of infinite-horizon MDPs with time-varying discount factors. We take a game-theoretic perspective -- whereby each time step is treated as an independent decision maker with their own (fixed) discount factor -- and we study the subgame perfect equilibrium (SPE) of the resulting game as well as the related algorithmic problems. We present a constructive proof of the existence of an SPE and demonstrate the EXPTIME-hardness of computing an SPE. We also turn to the approximate notion of $\epsilon$-SPE and show that an $\epsilon$-SPE exists under milder assumptions. An algorithm is presented to compute an $\epsilon$-SPE, of which an upper bound of the time complexity, as a function of the convergence property of the time-varying discount factor, is provided.
We consider the problem of learning from data corrupted by underrepresentation bias, where positive examples are filtered from the data at different, unknown rates for a fixed number of sensitive groups. We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out parameters, even in settings where intersectional group membership makes learning each intersectional rate computationally infeasible. Using this estimate for the group-wise drop-out rate, we construct a re-weighting scheme that allows us to approximate the loss of any hypothesis on the true distribution, even if we only observe the empirical error on a biased sample. Finally, we present an algorithm encapsulating this learning and re-weighting process, and we provide strong PAC-style guarantees that, with high probability, our estimate of the risk of the hypothesis over the true distribution will be arbitrarily close to the true risk.