Density deconvolution addresses the estimation of the unknown (probability) density function $f$ of a random signal from data that are observed with an independent additive random noise. This is a classical problem in statistics, for which frequentist and Bayesian nonparametric approaches are available to deal with static or batch data. In this paper, we consider the problem of density deconvolution in a streaming or online setting where noisy data arrive progressively, with no predetermined sample size, and we develop a sequential nonparametric approach to estimate $f$. By relying on a quasi-Bayesian sequential approach, often referred to as Newton's algorithm, we obtain estimates of $f$ that are of easy evaluation, computationally efficient, and with a computational cost that remains constant as the amount of data increases, which is critical in the streaming setting. Large sample asymptotic properties of the proposed estimates are studied, yielding provable guarantees with respect to the estimation of $f$ at a point (local) and on an interval (uniform). In particular, we establish local and uniform central limit theorems, providing corresponding asymptotic credible intervals and bands. We validate empirically our methods on synthetic and real data, by considering the common setting of Laplace and Gaussian noise distributions, and make a comparison with respect to the kernel-based approach and a Bayesian nonparametric approach with a Dirichlet process mixture prior.
Modelling multivariate spatio-temporal data with complex dependency structures is a challenging task but can be simplified by assuming that the original variables are generated from independent latent components. If these components are found, they can be modelled univariately. Blind source separation aims to recover the latent components by estimating the unmixing transformation based on the observed data only. The current methods for spatio-temporal blind source separation are restricted to linear unmixing, and nonlinear variants have not been implemented. In this paper, we extend identifiable variational autoencoder to the nonlinear nonstationary spatio-temporal blind source separation setting and demonstrate its performance using comprehensive simulation studies. Additionally, we introduce two alternative methods for the latent dimension estimation, which is a crucial task in order to obtain the correct latent representation. Finally, we illustrate the proposed methods using a meteorological application, where we estimate the latent dimension and the latent components, interpret the components, and show how nonstationarity can be accounted and prediction accuracy can be improved by using the proposed nonlinear blind source separation method as a preprocessing method.
We study the problem of modeling and inference for spatio-temporal count processes. Our approach uses parsimonious parameterisations of multivariate autoregressive count time series models, including possible regression on covariates. We control the number of parameters by specifying spatial neighbourhood structures for possibly huge matrices that take into account spatio-temporal dependencies. This work is motivated by real data applications which call for suitable models. Extensive simulation studies show that our approach yields reliable estimators.
This paper investigates the estimation of the interaction function for a class of McKean-Vlasov stochastic differential equations. The estimation is based on observations of the associated particle system at time $T$, considering the scenario where both the time horizon $T$ and the number of particles $N$ tend to infinity. Our proposed method recovers polynomial rates of convergence for the resulting estimator. This is achieved under the assumption of exponentially decaying tails for the interaction function. Additionally, we conduct a thorough analysis of the transform of the associated invariant density as a complex function, providing essential insights for our main results.
While score-based diffusion models have achieved exceptional sampling quality, their sampling speeds are often limited by the high computational burden of score function evaluations. Despite the recent remarkable empirical advances in speeding up the score-based samplers, theoretical understanding of acceleration techniques remains largely limited. To bridge this gap, we propose a novel training-free acceleration scheme for stochastic samplers. Under minimal assumptions -- namely, $L^2$-accurate score estimates and a finite second-moment condition on the target distribution -- our accelerated sampler provably achieves $\varepsilon$-accuracy in total variation within $\widetilde{O}(d^{5/4}/\sqrt{\varepsilon})$ iterations, thereby significantly improving upon the $\widetilde{O}(d/\varepsilon)$ iteration complexity of standard score-based samplers. Notably, our convergence theory does not rely on restrictive assumptions on the target distribution or higher-order score estimation guarantees.
Radial basis functions (RBFs) play an important role in function interpolation, in particular in an arbitrary set of interpolation nodes. The accuracy of the interpolation depends on a parameter called the shape parameter. There are many approaches in literature on how to appropriately choose it as to increase the accuracy of interpolation while avoiding instability issues. However, finding the optimal shape parameter value in general remains a challenge. In this work, we present a novel approach to determine the shape parameter in RBFs. First, we construct an optimisation problem to obtain a shape parameter that leads to an interpolation matrix with bounded condition number, then, we introduce a data-driven method that controls the condition of the interpolation matrix to avoid numerically unstable interpolations, while keeping a very good accuracy. In addition, a fall-back procedure is proposed to enforce a strict upper bound on the condition number, as well as a learning strategy to improve the performance of the data-driven method by learning from previously run simulations. We present numerical test cases to assess the performance of the proposed methods in interpolation tasks and in a RBF based finite difference (RBF-FD) method, in one and two-space dimensions.
By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance. And in a second follow-up experiment, we evaluate numerical imputation of one-hot encoded categorical attributes. We reach the following conclusions. Firstly, missing-indicators generally increase classification performance. Secondly, with missing-indicators, nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Thirdly, for decision trees, pruning is necessary to prevent overfitting. Fourthly, the thresholds above which missing-indicators are more likely than not to improve performance are lower for categorical attributes than for numerical attributes. Lastly, mean imputation of numerical attributes preserves some of the information from missing values. Consequently, when not using missing-indicators it can be advantageous to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.
For an infinite class of finite graphs of unbounded size, we define a limit object, to be called a $\textit{wide limit}$, relative to some computationally restricted class of functions. The limit object is a first order Boolean-valued structure. The first order properties of the wide limit then reflect how a computationally restricted viewer "sees" a generic member of the class. The construction uses arithmetic forcing with random variables [Kraj\'i\v{c}ek, Forcing with random variables and proof complexity 2011]. We give sufficient conditions for universal and existential sentences to be valid in the limit, provide several examples, and prove that such a limit object can then be expanded to a model of weak arithmetic. To illustrate the concept we give an example in which the wide limit relates to total NP search problems. In particular, we take the wide limit of all maps from $\{0,\dots,k-1\}$ to $\{0,\dots,\lfloor k/2\rfloor-1\}$ to obtain a model of $\forall \text{PV}_1(f)$ where the problem $\textbf{RetractionWeakPigeon}$ is total but $\textbf{WeakPigeon}$, the complete problem for $\textbf{PWPP}$, is not. Thus, we obtain a new proof of this unprovability and show it implies that $\textbf{WeakPigeon}$ is not many-one reducible to $\textbf{RetractionWeakPigeon}$ in the oracle setting.
The attention score matrix ${\rm SoftMax}(XY^T)$ encodes relational similarity patterns between objects and is extremely popular in machine learning. However, the complexity required to calculate it runs quadratically with the problem size, making it a computationally heavy solution. In this article, we propose a linear-time approximation of the attention score normalization constants for embedding vectors with bounded norms. We show on several pre-trained embeddings that the accuracy of our estimation formula surpasses competing kernel methods by even orders of magnitude. From this result, we design a linear-time and task-agnostic embedding algorithm based on the optimization of the attention scores. The proposed algorithm is highly interpretable and easily adapted to an arbitrary embedding problem. We consider a few use-cases and observe similar or higher performances and a lower computational time with respect to comparable embedding algorithms.
We develop and analyze data subsampling techniques for Poisson regression, the standard model for count data $y\in\mathbb{N}$. In particular, we consider the Poisson generalized linear model with ID- and square root-link functions. We consider the method of coresets, which are small weighted subsets that approximate the loss function of Poisson regression up to a factor of $1\pm\varepsilon$. We show $\Omega(n)$ lower bounds against coresets for Poisson regression that continue to hold against arbitrary data reduction techniques up to logarithmic factors. By introducing a novel complexity parameter and a domain shifting approach, we show that sublinear coresets with $1\pm\varepsilon$ approximation guarantee exist when the complexity parameter is small. In particular, the dependence on the number of input points can be reduced to polylogarithmic. We show that the dependence on other input parameters can also be bounded sublinearly, though not always logarithmically. In particular, we show that the square root-link admits an $O(\log(y_{\max}))$ dependence, where $y_{\max}$ denotes the largest count presented in the data, while the ID-link requires a $\Theta(\sqrt{y_{\max}/\log(y_{\max})})$ dependence. As an auxiliary result for proving the tightness of the bound with respect to $y_{\max}$ in the case of the ID-link, we show an improved bound on the principal branch of the Lambert $W_0$ function, which may be of independent interest. We further show the limitations of our analysis when $p$th degree root-link functions for $p\geq 3$ are considered, which indicate that other analytical or computational methods would be required if such a generalization is even possible.
There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.