The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $\phi(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a $k$-sparse linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing all the relevant features, and the goal is to learn a near-optimal policy in only poly$(k,\log d)$ interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible.
The Fisher-Kolmogorov equation is a diffusion-reaction PDE that is used to model the accumulation of prionic proteins, which are responsible for many different neurological disorders. Likely, the most important and studied misfolded protein in literature is the Amyloid-$\beta$, responsible for the onset of Alzheimer disease. Starting from medical images we construct a reduced-order model based on a graph brain connectome. The reaction coefficient of the proteins is modelled as a stochastic random field, taking into account all the many different underlying physical processes, which can hardly be measured. Its probability distribution is inferred by means of the Monte Carlo Markov Chain method applied to clinical data. The resulting model is patient-specific and can be employed for predicting the disease's future development. Forward uncertainty quantification techniques (Monte Carlo and sparse grid stochastic collocation) are applied with the aim of quantifying the impact of the variability of the reaction coefficient on the progression of protein accumulation within the next 20 years.
Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new sufficient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on $\mathbb{R}^d$ to substantially broaden the known conditions for KSD separation and convergence control and to develop the first KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.
We propose a shared semantic map architecture to construct and configure Model Predictive Controllers (MPC) dynamically, that solve navigation problems for multiple robotic agents sharing parts of the same environment. The navigation task is represented as a sequence of semantically labeled areas in the map, that must be traversed sequentially, i.e. a route. Each semantic label represents one or more constraints on the robots' motion behaviour in that area. The advantages of this approach are: (i) an MPC-based motion controller in each individual robot can be (re-)configured, at runtime, with the locally and temporally relevant parameters; (ii) the application can influence, also at runtime, the navigation behaviour of the robots, just by adapting the semantic labels; and (iii) the robots can reason about their need for coordination, through analyzing over which horizon in time and space their routes overlap. The paper provides simulations of various representative situations, showing that the approach of runtime configuration of the MPC drastically decreases computation time, while retaining task execution performance similar to an approach in which each robot always includes all other robots in its MPC computations.
We revisit the noisy binary search model of Karp and Kleinberg, in which we have $n$ coins with unknown probabilities $p_i$ that we can flip. The coins are sorted by increasing $p_i$, and we would like to find where the probability crosses (to within $\varepsilon$) of a target value $\tau$. This generalized the fixed-noise model of Burnashev and Zigangirov , in which $p_i = \frac{1}{2} \pm \varepsilon$, to a setting where coins near the target may be indistinguishable from it. Karp and Kleinberg showed that $\Theta(\frac{1}{\varepsilon^2} \log n)$ samples are necessary and sufficient for this task. We produce a practical algorithm by solving two theoretical challenges: high-probability behavior and sharp constants. We give an algorithm that succeeds with probability $1-\delta$ from \[ \frac{1}{C_{\tau, \varepsilon}} \cdot \left(\lg n + O(\log^{2/3} n \log^{1/3} \frac{1}{\delta} + \log \frac{1}{\delta})\right) \] samples, where $C_{\tau, \varepsilon}$ is the optimal such constant achievable. For $\delta > n^{-o(1)}$ this is within $1 + o(1)$ of optimal, and for $\delta \ll 1$ it is the first bound within constant factors of optimal.
In this work, we demonstrate the application of a first-order Taylor expansion to approximate a generic function $F: R^{n \times m} \to R^{n \times m}$ and utilize it in language modeling. To enhance the basic Taylor expansion, we introduce iteration and piecewise modeling, leading us to name the algorithm the Iterative Piecewise Affine (IPA) approximation. The final algorithm exhibits interesting resemblances to the Transformers decoder architecture. By comparing parameter arrangements in IPA and Transformers, we observe a strikingly similar performance, with IPA outperforming Transformers by 1.5\% in the next token prediction task with cross-entropy loss for smaller sequence lengths.
Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
A range family $\mathcal{R}$ is a family of subsets of $\mathbb{R}^d$, like all halfplanes, or all unit disks. Given a range family $\mathcal{R}$, we consider the $m$-uniform range capturing hypergraphs $\mathcal{H}(V,\mathcal{R},m)$ whose vertex-sets $V$ are finite sets of points in $\mathbb{R}^d$ with any $m$ vertices forming a hyperedge $e$ whenever $e = V \cap R$ for some $R \in \mathcal{R}$. Given additionally an integer $k \geq 2$, we seek to find the minimum $m = m_{\mathcal{R}}(k)$ such that every $\mathcal{H}(V,\mathcal{R},m)$ admits a polychromatic $k$-coloring of its vertices, that is, where every hyperedge contains at least one point of each color. Clearly, $m_{\mathcal{R}}(k) \geq k$ and the gold standard is an upper bound $m_{\mathcal{R}}(k) = O(k)$ that is linear in $k$. A $t$-shallow hitting set in $\mathcal{H}(V,\mathcal{R},m)$ is a subset $S \subseteq V$ such that $1 \leq |e \cap S| \leq t$ for each hyperedge $e$; i.e., every hyperedge is hit at least once but at most $t$ times by $S$. We show for several range families $\mathcal{R}$ the existence of $t$-shallow hitting sets in every $\mathcal{H}(V,\mathcal{R},m)$ with $t$ being a constant only depending on $\mathcal{R}$. This in particular proves that $m_{\mathcal{R}}(k) \leq tk = O(k)$ in such cases, improving previous polynomial bounds in $k$. Particularly, we prove this for the range families of all axis-aligned strips in $\mathbb{R}^d$, all bottomless and topless rectangles in $\mathbb{R}^2$, and for all unit-height axis-aligned rectangles in $\mathbb{R}^2$.
We consider the classical Shiryaev--Roberts martingale diffusion, $(R_t)_{t\ge0}$, restricted to the interval $[0,A]$, where $A>0$ is a preset absorbing boundary. We take yet another look at the well-known phenomenon of quasi-stationarity (time-invariant probabilistic behavior, conditional on no absorbtion hitherto) exhibited by the diffusion in the temporal limit, as $t\to+\infty$, for each $A>0$. We obtain new upper- and lower-bounds for the quasi-stationary distribution's probability density function (pdf), $q_{A}(x)$; the bounds vary in the trade-off between simplicity and tightness. The bounds imply directly the expected result that $q_{A}(x)$ converges to the pdf, $h(x)$, of the diffusion's stationary distribution, as $A\to+\infty$; the convergence is pointwise, for all $x\ge0$. The bounds also yield an explicit upperbound for the gap between $q_{A}(x)$ and $h(x)$ for a fixed $x$. By virtue of integration the bounds for the pdf $q_{A}(x)$ translate into new bounds for the corresponding cumulative distribution function (cdf), $Q_{A}(x)$. All of our results are established explicitly, using certain latest monotonicity properties of the modified Bessel $K$ function involved in the exact closed-form formula for $q_{A}(x)$ recently obtained by Polunchenko (2017). We conclude with a discussion of potential applications of our results in quickest change-point detection: our bounds allow for a very accurate performance analysis of the so-called randomized Shiryaev--Roberts--Pollak change-point detection procedure.
Generative Flow Networks (GFlowNets), a class of generative models over discrete and structured sample spaces, have been previously applied to the problem of inferring the marginal posterior distribution over the directed acyclic graph (DAG) of a Bayesian Network, given a dataset of observations. Based on recent advances extending this framework to non-discrete sample spaces, we propose in this paper to approximate the joint posterior over not only the structure of a Bayesian Network, but also the parameters of its conditional probability distributions. We use a single GFlowNet whose sampling policy follows a two-phase process: the DAG is first generated sequentially one edge at a time, and then the corresponding parameters are picked once the full structure is known. Since the parameters are included in the posterior distribution, this leaves more flexibility for the local probability models of the Bayesian Network, making our approach applicable even to non-linear models parametrized by neural networks. We show that our method, called JSP-GFN, offers an accurate approximation of the joint posterior, while comparing favorably against existing methods on both simulated and real data.
We consider the performance of a least-squares regression model, as judged by out-of-sample $R^2$. Shapley values give a fair attribution of the performance of a model to its input features, taking into account interdependencies between features. Evaluating the Shapley values exactly requires solving a number of regression problems that is exponential in the number of features, so a Monte Carlo-type approximation is typically used. We focus on the special case of least-squares regression models, where several tricks can be used to compute and evaluate regression models efficiently. These tricks give a substantial speed up, allowing many more Monte Carlo samples to be evaluated, achieving better accuracy. We refer to our method as least-squares Shapley performance attribution (LS-SPA), and describe our open-source implementation.