Amongst Markov chain Monte Carlo algorithms, Hamiltonian Monte Carlo (HMC) is often the algorithm of choice for complex, high-dimensional target distributions; however, its efficiency is notoriously sensitive to the choice of the integration-time tuning parameter, $T$. When integrating both forward and backward in time using the same leapfrog integration step as HMC, the set of local maxima in the potential along a path, or apogees, is the same whatever point (position and momentum) along the path is chosen to initialise the integration. We present the Apogee to Apogee Path Sampler (AAPS), which utilises this invariance to create a simple yet generic methodology for constructing a path, proposing a point from it and accepting or rejecting that proposal so as to target the intended distribution. We demonstrate empirically that AAPS has a similar efficiency to HMC but is much more robust to the setting of its equivalent tuning parameter, the number of apogees that the path crosses.
State of the art reinforcement learning has enabled training agents on tasks of ever increasing complexity. However, the current paradigm tends to favor training agents from scratch on every new task or on collections of tasks with a view towards generalizing to novel task configurations. The former suffers from poor data efficiency while the latter is difficult when test tasks are out-of-distribution. Agents that can effectively transfer their knowledge about the world pose a potential solution to these issues. In this paper, we investigate transfer learning in the context of model-based agents. Specifically, we aim to understand when exactly environment models have an advantage and why. We find that a model-based approach outperforms controlled model-free baselines for transfer learning. Through ablations, we show that both the policy and dynamics model learnt through exploration matter for successful transfer. We demonstrate our results across three domains which vary in their requirements for transfer: in-distribution procedural (Crafter), in-distribution identical (RoboDesk), and out-of-distribution (Meta-World). Our results show that intrinsic exploration combined with environment models present a viable direction towards agents that are self-supervised and able to generalize to novel reward functions.
Learning from demonstration (LfD) has the potential to greatly increase the applicability of robotic manipulators in modern industrial applications. Recent progress in LfD methods have put more emphasis in learning robustness than in guiding the demonstration itself in order to improve robustness. The latter is particularly important to consider when the target system reproducing the motion is structurally different to the demonstration system, as some demonstrated motions may not be reproducible. In light of this, this paper introduces a new guided learning from demonstration paradigm where an interactive graphical user interface (GUI) guides the user during demonstration, preventing them from demonstrating non-reproducible motions. The key aspect of our approach is determining the space of reproducible motions based on a motion planning framework which finds regions in the task space where trajectories are guaranteed to be of bounded length. We evaluate our method on two different setups with a six-degree-of-freedom (DOF) UR5 as the target system. First our method is validated using a seven-DOF Sawyer as the demonstration system. Then an extensive user study is carried out where several participants are asked to demonstrate, with and without guidance, a mock weld task using a hand held tool tracked by a VICON system. With guidance users were able to always carry out the task successfully in comparison to only 44% of the time without guidance.
Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills which aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we propose a novel provably efficient algorithm (the sample complexity is $\tilde{O}({\rm poly}(1/\epsilon))$ where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation. Our algorithm only requires nearly minimal assumptions of the dataset (single-policy concentrability) and the function class (realizability). Moreover, our algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. To the best of our knowledge, this is the first algorithm with general function approximation and single-policy concentrability that is both statistically efficient and computationally stable.
Compressed sensing allows for the recovery of sparse signals from few measurements, whose number is proportional to the sparsity of the unknown signal, up to logarithmic factors. The classical theory typically considers either random linear measurements or subsampled isometries and has found many applications, including accelerated magnetic resonance imaging, which is modeled by the subsampled Fourier transform. In this work, we develop a general theory of infinite-dimensional compressed sensing for abstract inverse problems, possibly ill-posed, involving an arbitrary forward operator. This is achieved by considering a generalized restricted isometry property, and a quasi-diagonalization property of the forward map. As a notable application, for the first time, we obtain rigorous recovery estimates for the sparse Radon transform (i.e., with a finite number of angles $\theta_1,\dots,\theta_m$), which models computed tomography. In the case when the unknown signal is $s$-sparse with respect to an orthonormal basis of compactly supported wavelets, we prove exact recovery under the condition \[ m\gtrsim s, \] up to logarithmic factors.
We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.
Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.
Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49\% in terms of Eval Reward Return.
Estimating time-varying graphical models are of paramount importance in various social, financial, biological, and engineering systems, since the evolution of such networks can be utilized for example to spot trends, detect anomalies, predict vulnerability, and evaluate the impact of interventions. Existing methods require extensive tuning of parameters that control the graph sparsity and temporal smoothness. Furthermore, these methods are computationally burdensome with time complexity $O(NP^3)$ for $P$ variables and $N$ time points. As a remedy, we propose a low-complexity tuning-free Bayesian approach, named BASS. Specifically, we impose temporally-dependent spike-and-slab priors on the graphs such that they are sparse and varying smoothly across time. A variational inference algorithm is then derived to learn the graph structures from the data automatically. Owning to the pseudo-likelihood and the mean-field approximation, the time complexity of BASS is only $O(NP^2)$. Additionally, by identifying the frequency-domain resemblance to the time-varying graphical models, we show that BASS can be extended to learning frequency-varying inverse spectral density matrices, and yields graphical models for multivariate stationary time series. Numerical results on both synthetic and real data show that that BASS can better recover the underlying true graphs, while being more efficient than the existing methods, especially for high-dimensional cases.
The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance metric between distributions that has found utility in two-sample testing. The usual kernel-MMD test statistic is a degenerate U-statistic under the null, and thus it has an intractable limiting distribution. Hence, to design a level-$\alpha$ test, one usually selects the rejection threshold as the $(1-\alpha)$-quantile of the permutation distribution. The resulting nonparametric test has finite-sample validity but suffers from large computational cost, since every permutation takes quadratic time. We propose the cross-MMD, a new quadratic-time MMD test statistic based on sample-splitting and studentization. We prove that under mild assumptions, the cross-MMD has a limiting standard Gaussian distribution under the null. Importantly, we also show that the resulting test is consistent against any fixed alternative, and when using the Gaussian kernel, it has minimax rate-optimal power against local alternatives. For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.
Model complexity remains a key feature of any proposed data generating mechanism. Measures of complexity can be extended to complex patterns such as signals in time and graphs. In this paper, we are concerned with the well-studied class of exchangeable graphs. Exchangeability for graphs implies a distributional invariance under node permutation and is a suitable default model that can widely be used for network data. For this well-studied class of graphs, we make a choice to quantify model complexity based on the (Shannon) entropy, resulting in graphon entropy. We estimate the entropy of the generating mechanism of a given graph, instead of choosing a specific graph descriptor suitable only for one graph generating mechanism. In this manner, we naturally consider the global properties of a graph and capture its important graph-theoretic and topological properties. Under an increasingly complex set of generating mechanisms, we propose a set of estimators of graphon entropy as measures of complexity for real-world graphs. We determine the large--sample properties of such estimators and discuss their usage for characterizing evolving real-world graphs.