We consider the problem of offline reinforcement learning where only a set of system transitions is made available for policy optimization. Following recent advances in the field, we consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts. This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system. The standard solution is to rely on ensembles for uncertainty heuristics and to avoid exploiting the model where it is too uncertain. We challenge the popular belief that we must resort to ensembles by showing that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark. We also analyze static metrics of model-learning and conclude on the important model properties for the final performance of the agent.
We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.
The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.
When complex Bayesian models exhibit implausible behaviour, one solution is to assemble available information into an informative prior. Challenges arise as prior information is often only available for the observable quantity, or some model-derived marginal quantity, rather than directly pertaining to the natural parameters in our model. We propose a method for translating available prior information, in the form of an elicited distribution for the observable or model-derived marginal quantity, into an informative joint prior. Our approach proceeds given a parametric class of prior distributions with as yet undetermined hyperparameters, and minimises the difference between the supplied elicited distribution and corresponding prior predictive distribution. We employ a global, multi-stage Bayesian optimisation procedure to locate optimal values for the hyperparameters. Three examples illustrate our approach: a cure-fraction survival model, where censoring implies that the observable quantity is a priori a mixed discrete/continuous quantity; a setting in which prior information pertains to $R^{2}$ -- a model-derived quantity; and a nonlinear regression model.
This work presents an abstract framework for the design, implementation, and analysis of the multiscale spectral generalized finite element method (MS-GFEM), a particular numerical multiscale method originally proposed in [I. Babuska and R. Lipton, Multiscale Model.\;\,Simul., 9 (2011), pp.~373--406]. MS-GFEM is a partition of unity method employing optimal local approximation spaces constructed from local spectral problems. We establish a general local approximation theory demonstrating exponential convergence with respect to local degrees of freedom under certain assumptions, with explicit dependence on key problem parameters. Our framework applies to a broad class of multiscale PDEs with $L^{\infty}$-coefficients in both continuous and discrete, finite element settings, including highly indefinite problems (convection-dominated diffusion, as well as the high-frequency Helmholtz, Maxwell and elastic wave equations with impedance boundary conditions), and higher-order problems. Notably, we prove a local convergence rate of $O(e^{-cn^{1/d}})$ for MS-GFEM for all these problems, improving upon the $O(e^{-cn^{1/(d+1)}})$ rate shown by Babuska and Lipton. Moreover, based on the abstract local approximation theory for MS-GFEM, we establish a unified framework for showing low-rank approximations to multiscale PDEs. This framework applies to the aforementioned problems, proving that the associated Green's functions admit an $O(|\log\epsilon|^{d})$-term separable approximation on well-separated domains with error $\epsilon>0$. Our analysis improves and generalizes the result in [M. Bebendorf and W. Hackbusch, Numerische Mathematik, 95 (2003), pp.~1-28] where an $O(|\log\epsilon|^{d+1})$-term separable approximation was proved for Poisson-type problems.
Predictions in the form of probability distributions are crucial for decision-making. Quantile regression enables this within spatial interpolation settings for merging remote sensing and gauge precipitation data. However, ensemble learning of quantile regression algorithms remains unexplored in this context. Here, we address this gap by introducing nine quantile-based ensemble learners and applying them to large precipitation datasets. We employed a novel feature engineering strategy, reducing predictors to distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six stacking and three simple methods (mean, median, best combiner), combining six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different stacking methods. We evaluated performance against QR using quantile scoring functions in a large dataset comprising 15 years of monthly gauge-measured and satellite precipitation in contiguous US (CONUS). Stacking with QR and QRNN yielded the best results across quantile levels of interest (0.025, 0.050, 0.075, 0.100, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 0.925, 0.950, 0.975), surpassing the reference method by 3.91% to 8.95%. This demonstrates the potential of stacking to improve probabilistic predictions in spatial interpolation and beyond.
In many communication contexts, the capabilities of the involved actors cannot be known beforehand, whether it is a cell, a plant, an insect, or even a life form unknown to Earth. Regardless of the recipient, the message space and time scale could be too fast, too slow, too large, or too small and may never be decoded. Therefore, it pays to devise a way to encode messages agnostic of space and time scales. We propose the use of fractal functions as self-executable infinite-frequency carriers for sending messages, given their properties of structural self-similarity and scale invariance. We call it `fractal messaging'. Starting from a spatial embedding, we introduce a framework for a space-time scale-free messaging approach to this challenge. When considering a space and time-agnostic framework for message transmission, it would be interesting to encode a message such that it could be decoded at several spatio-temporal scales. Hence, the core idea of the framework proposed herein is to encode a binary message as waves along infinitely many frequencies (in power-like distributions) and amplitudes, transmit such a message, and then decode and reproduce it. To do so, the components of the Weierstrass function, a known fractal, are used as carriers of the message. Each component will have its amplitude modulated to embed the binary stream, allowing for a space-time-agnostic approach to messaging.
Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.
Existing schemes for demonstrating quantum computational advantage are subject to various practical restrictions, including the hardness of verification and challenges in experimental implementation. Meanwhile, analog quantum simulators have been realized in many experiments to study novel physics. In this work, we propose a quantum advantage protocol based on single-step Feynman-Kitaev verification of an analog quantum simulation, in which the verifier need only run an $O(\lambda^2)$-time classical computation, and the prover need only prepare $O(1)$ samples of a history state and perform $O(\lambda^2)$ single-qubit measurements, for a security parameter $\lambda$. We also propose a near-term feasible strategy for honest provers and discuss potential experimental realizations.
Graph-centric artificial intelligence (graph AI) has achieved remarkable success in modeling interacting systems prevalent in nature, from dynamical systems in biology to particle physics. The increasing heterogeneity of data calls for graph neural architectures that can combine multiple inductive biases. However, combining data from various sources is challenging because appropriate inductive bias may vary by data modality. Multimodal learning methods fuse multiple data modalities while leveraging cross-modal dependencies to address this challenge. Here, we survey 140 studies in graph-centric AI and realize that diverse data types are increasingly brought together using graphs and fed into sophisticated multimodal models. These models stratify into image-, language-, and knowledge-grounded multimodal learning. We put forward an algorithmic blueprint for multimodal graph learning based on this categorization. The blueprint serves as a way to group state-of-the-art architectures that treat multimodal data by choosing appropriately four different components. This effort can pave the way for standardizing the design of sophisticated multimodal architectures for highly complex real-world problems.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.