Particle filtering methods are well developed for continuous state-space models. When dealing with discrete spaces on bounded domains, particle filtering methods can still be applied to sample from and marginalise over the unknown hidden states. Nevertheless, problems such as particle degradation can arise in this context and be even more severe than they are within the continuous-state domain: proposed particles can easily be incompatible with the data and the discrete system could often result in all particles having weights of zero. However, if the boundaries of the discrete hidden space are known, then these could be used to prevent particle collapse. In this paper we introduce the Lifebelt Particle Filter (LBPF), a novel method for robust likelihood estimation when low-valued count data arise. The LBPF combines a standard particle filter with one (or more) \textit{lifebelt particles} which, by construction, will tend not to be incompatible with the data. A mixture of resampled and non-resampled particles allows for the preservation of the lifebelt particle, which, together with the remaining particle swarm, provides samples from the filtering distribution, and can be used to generate estimates of the likelihood. The LBPF can be used within a pseudo-marginal scheme to draw inference on static parameters, $ \boldsymbol{\theta} $, governing a discrete state-space model with low-valued counts. We present here the applied case estimating a parameter governing probabilities and timings of deaths and recoveries of hospitalised patients during an epidemic.
We study the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of Q-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results. We start by reviewing theoretical bounds on the performance of approximate dynamic programming algorithms. We then introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of Q-learning-based algorithms with function approximation, both online and offline. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline Q-learning-based algorithms under different environments. According to our results: (i) high entropy data distributions are well-suited for learning in an offline manner; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.
Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering other practical issues, such as robustness to contaminated samples. Recent work by Nguyen et al., (2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods for alleviating the issue of data contamination. Specifically, we introduce a series of self-attention mechanisms that can be incorporated into different Transformer architectures and discuss the special properties of each method. We then perform extensive empirical studies on language modeling and image classification tasks. Our methods demonstrate robust performance in multiple scenarios while maintaining competitive results on clean datasets.
This paper proposes an online, provably robust, and scalable Bayesian approach for changepoint detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective, and also addresses the scalability issues of previous attempts. Specifically, the proposed generalised Bayesian formalism leads to conjugate posteriors whose parameters are available in closed form by leveraging diffusion score matching. The resulting algorithm is exact, can be updated through simple algebra, and is more than 10 times faster than its closest competitor.
We consider the differentially private estimation of multiple quantiles (MQ) of a distribution from a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem. We establish that the resulting method is closely related to the recently published ad hoc algorithm JointExp. In particular, they share the same computational complexity and a similar efficiency. We prove the statistical consistency of these two algorithms for continuous distributions. Furthermore, we demonstrate both theoretically and empirically that this method suffers from an important lack of performance in the case of peaked distributions, which can degrade up to a potentially catastrophic impact in the presence of atoms. Its smoothed version (i.e. by applying a max kernel to its output density) would solve this problem, but remains an open challenge to implement. As a proxy, we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.
We consider the problem of learning the dynamics of a linear system when one has access to data generated by an auxiliary system that shares similar (but not identical) dynamics, in addition to data from the true system. We use a weighted least squares approach, and provide a finite sample error bound of the learned model as a function of the number of samples and various system parameters from the two systems as well as the weight assigned to the auxiliary data. We show that the auxiliary data can help to reduce the intrinsic system identification error due to noise, at the price of adding a portion of error that is due to the differences between the two system models. We further provide a data-dependent bound that is computable when some prior knowledge about the systems is available. This bound can also be used to determine the weight that should be assigned to the auxiliary data during the model training stage.
The two-sample problem, which consists in testing whether independent samples on $\mathbb{R}^d$ are drawn from the same (unknown) distribution, finds applications in many areas. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various sources, poorly controlled, leading to datasets possibly exhibiting a strong sampling bias. While classic methods relying on the computation of a discrepancy measure between the empirical distributions face the curse of dimensionality, we develop an alternative approach based on statistical learning and extending rank tests, capable of detecting small departures from the null assumption in the univariate case when appropriately designed. Overcoming the lack of natural order on $\mathbb{R}^d$ when $d\geq 2$, it is implemented in two steps. Assigning to each of the samples a label (positive vs. negative) and dividing them into two parts, a preorder on $\mathbb{R}^d$ defined by a real-valued scoring function is learned by means of a bipartite ranking algorithm applied to the first part and a rank test is applied next to the scores of the remaining observations to detect possible differences in distribution. Because it learns how to project the data onto the real line nearly like (any monotone transform of) the likelihood ratio between the original multivariate distributions would do, the approach is not much affected by the dimensionality, ignoring ranking model bias issues, and preserves the advantages of univariate rank tests. Nonasymptotic error bounds are proved based on recent concentration results for two-sample linear rank-processes and an experimental study shows that the approach promoted surpasses alternative methods standing as natural competitors.
This paper presents a new method to estimate systematic errors in the maximum-likelihood regression of count data. The method is applicable in particular to X-ray spectra in situations where the Poisson log-likelihood, or the Cash goodness-of-fit statistic, indicate a poor fit that is attributable to overdispersion of the data. Overdispersion in Poisson data is treated as an intrinsic model variance that can be estimated from the best-fit model, using the maximum-likelihood Cmin statistic. The paper also studies the effects of such systematic errors on the Delta C likelihood-ratio statistic, which can be used to test for the presence of a nested model component in the regression of Poisson count data. The paper introduces an overdispersed chi-square distribution that results from the convolution of a chi-square distribution that models the usual Delta C statistic, and a zero-mean Gaussian that models the overdispersion in the data. This is proposed as the distribution of choice for the Delta C statistic in the presence of systematic errors. The methods presented in this paper are applied to XMM-Newton data of the quasar 1ES 1553+113 that were used to detect absorption lines from an intervening warm-hot intergalactic medium (WHIM). This case study illustrates how systematic errors can be estimated from the data, and their effect on the detection of a nested component, such as an absorption line, with the Delta C statistic.
A confidence sequence (CS) is a sequence of confidence intervals that is valid at arbitrary data-dependent stopping times. These are useful in applications like A/B testing, multi-armed bandits, off-policy evaluation, election auditing, etc. We present three approaches to constructing a confidence sequence for the population mean, under the minimal assumption that only an upper bound $\sigma^2$ on the variance is known. While previous works rely on light-tail assumptions like boundedness or subGaussianity (under which all moments of a distribution exist), the confidence sequences in our work are able to handle data from a wide range of heavy-tailed distributions. The best among our three methods -- the Catoni-style confidence sequence -- performs remarkably well in practice, essentially matching the state-of-the-art methods for $\sigma^2$-subGaussian data, and provably attains the $\sqrt{\log \log t/t}$ lower bound due to the law of the iterated logarithm. Our findings have important implications for sequential experimentation with unbounded observations, since the $\sigma^2$-bounded-variance assumption is more realistic and easier to verify than $\sigma^2$-subGaussianity (which implies the former). We also extend our methods to data with infinite variance, but having $p$-th central moment ($1<p<2$).
Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.