$f$-divergences, which quantify discrepancy between probability distributions, are ubiquitous in information theory, machine learning, and statistics. While there are numerous methods for estimating $f$-divergences from data, a limit distribution theory, which quantifies fluctuations of the estimation error, is largely obscure. As limit theorems are pivotal for valid statistical inference, to close this gap, we develop a general methodology for deriving distributional limits for $f$-divergences based on the functional delta method and Hadamard directional differentiability. Focusing on four prominent $f$-divergences -- Kullback-Leibler divergence, $\chi^2$ divergence, squared Hellinger distance, and total variation distance -- we identify sufficient conditions on the population distributions for the existence of distributional limits and characterize the limiting variables. These results are used to derive one- and two-sample limit theorems for Gaussian-smoothed $f$-divergences, both under the null and the alternative. Finally, an application of the limit distribution theory to auditing differential privacy is proposed and analyzed for significance level and power against local alternatives.
In the analysis of the $h$-version of the finite-element method (FEM), with fixed polynomial degree $p$, applied to the Helmholtz equation with wavenumber $k\gg 1$, the $\textit{asymptotic regime}$ is when $(hk)^p C_{\rm sol}$ is sufficiently small and the sequence of Galerkin solutions are quasioptimal; here $C_{\rm sol}$ is the norm of the Helmholtz solution operator, normalised so that $C_{\rm sol} \sim k$ for nontrapping problems. In the $\textit{preasymptotic regime}$, one expects that if $(hk)^{2p}C_{\rm sol}$ is sufficiently small, then (for physical data) the relative error of the Galerkin solution is controllably small. In this paper, we prove the natural error bounds in the preasymptotic regime for the variable-coefficient Helmholtz equation in the exterior of a Dirichlet, or Neumann, or penetrable obstacle (or combinations of these) and with the radiation condition $\textit{either}$ realised exactly using the Dirichlet-to-Neumann map on the boundary of a ball $\textit{or}$ approximated either by a radial perfectly-matched layer (PML) or an impedance boundary condition. Previously, such bounds for $p>1$ were only available for Dirichlet obstacles with the radiation condition approximated by an impedance boundary condition. Our result is obtained via a novel generalisation of the "elliptic-projection" argument (the argument used to obtain the result for $p=1$) which can be applied to a wide variety of abstract Helmholtz-type problems.
Ordinary differential equations (ODEs) can provide mechanistic models of temporally local changes of processes, where parameters are often informed by external knowledge. While ODEs are popular in systems modeling, they are less established for statistical modeling of longitudinal cohort data, e.g., in a clinical setting. Yet, modeling of local changes could also be attractive for assessing the trajectory of an individual in a cohort in the immediate future given its current status, where ODE parameters could be informed by further characteristics of the individual. However, several hurdles so far limit such use of ODEs, as compared to regression-based function fitting approaches. The potentially higher level of noise in cohort data might be detrimental to ODEs, as the shape of the ODE solution heavily depends on the initial value. In addition, larger numbers of variables multiply such problems and might be difficult to handle for ODEs. To address this, we propose to use each observation in the course of time as the initial value to obtain multiple local ODE solutions and build a combined estimator of the underlying dynamics. Neural networks are used for obtaining a low-dimensional latent space for dynamic modeling from a potentially large number of variables, and for obtaining patient-specific ODE parameters from baseline variables. Simultaneous identification of dynamic models and of a latent space is enabled by recently developed differentiable programming techniques. We illustrate the proposed approach in an application with spinal muscular atrophy patients and a corresponding simulation study. In particular, modeling of local changes in health status at any point in time is contrasted to the interpretation of functions obtained from a global regression. This more generally highlights how different application settings might demand different modeling strategies.
The paper studies nonstationary high-dimensional vector autoregressions of order $k$, VAR($k$). Additional deterministic terms such as trend or seasonality are allowed. The number of time periods, $T$, and the number of coordinates, $N$, are assumed to be large and of the same order. Under this regime the first-order asymptotics of the Johansen likelihood ratio (LR), Pillai-Bartlett, and Hotelling-Lawley tests for cointegration are derived: the test statistics converge to nonrandom integrals. For more refined analysis, the paper proposes and analyzes a modification of the Johansen test. The new test for the absence of cointegration converges to the partial sum of the Airy$_1$ point process. Supporting Monte Carlo simulations indicate that the same behavior persists universally in many situations beyond those considered in our theorems. The paper presents empirical implementations of the approach for the analysis of S$\&$P$100$ stocks and of cryptocurrencies. The latter example has a strong presence of multiple cointegrating relationships, while the results for the former are consistent with the null of no cointegration.
In this paper, the stability of $\theta$-methods for delay differential equations is studied based on the test equation $y'(t)=-A y(t) + B y(t-\tau)$, where $\tau$ is a constant delay and $A$ is a positive definite matrix. It is mainly considered the case where the matrices $A$ and $B$ are not simultaneosly diagonalizable and the concept of field of values is used to prove a sufficient condition for unconditional stability of these methods and another condition which also guarantees their stability, but according to the step size. The results obtained are also simplified for the case where the matrices $A$ and $B$ are simultaneously diagonalizable and compared with other similar works for the general case. Several numerical examples in which the theory discussed here is applied to parabolic problems given by partial delay differential equations with a diffusion term and a delayed term are presented, too.
Score-based diffusion models have emerged as one of the most promising frameworks for deep generative modelling, due to their state-of-the art performance in many generation tasks while relying on mathematical foundations such as stochastic differential equations (SDEs) and ordinary differential equations (ODEs). Empirically, it has been reported that ODE based samples are inferior to SDE based samples. In this paper we rigorously describe the range of dynamics and approximations that arise when training score-based diffusion models, including the true SDE dynamics, the neural approximations, the various approximate particle dynamics that result, as well as their associated Fokker--Planck equations and the neural network approximations of these Fokker--Planck equations. We systematically analyse the difference between the ODE and SDE dynamics of score-based diffusion models, and link it to an associated Fokker--Planck equation. We derive a theoretical upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a Fokker--Planck residual. We also show numerically that conventional score-based diffusion models can exhibit significant differences between ODE- and SDE-induced distributions which we demonstrate using explicit comparisons. Moreover, we show numerically that reducing the Fokker--Planck residual by adding it as an additional regularisation term leads to closing the gap between ODE- and SDE-induced distributions. Our experiments suggest that this regularisation can improve the distribution generated by the ODE, however that this can come at the cost of degraded SDE sample quality.
The task of out-of-distribution (OOD) detection is crucial for deploying machine learning models in real-world settings. In this paper, we observe that the singular value distributions of the in-distribution (ID) and OOD features are quite different: the OOD feature matrix tends to have a larger dominant singular value than the ID feature, and the class predictions of OOD samples are largely determined by it. This observation motivates us to propose \texttt{RankFeat}, a simple yet effective \emph{post hoc} approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature. \texttt{RankFeat} achieves \emph{state-of-the-art} performance and reduces the average false positive rate (FPR95) by 17.90\% compared with the previous best method. The success of \texttt{RankFeat} motivates us to investigate whether a similar phenomenon would exist in the parameter matrices of neural networks. We thus propose \texttt{RankWeight} which removes the rank-1 weight from the parameter matrices of a single deep layer. Our \texttt{RankWeight}is also \emph{post hoc} and only requires computing the rank-1 matrix once. As a standalone approach, \texttt{RankWeight} has very competitive performance against other methods across various backbones. Moreover, \texttt{RankWeight} enjoys flexible compatibility with a wide range of OOD detection methods. The combination of \texttt{RankWeight} and \texttt{RankFeat} refreshes the new \emph{state-of-the-art} performance, achieving the FPR95 as low as 16.13\% on the ImageNet-1k benchmark. Extensive ablation studies and comprehensive theoretical analyses are presented to support the empirical results.
We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate. We point out relations of the latter to sub-Riemannian geometry.
We propose a new randomized method for solving systems of nonlinear equations, which can find sparse solutions or solutions under certain simple constraints. The scheme only takes gradients of component functions and uses Bregman projections onto the solution space of a Newton equation. In the special case of euclidean projections, the method is known as nonlinear Kaczmarz method. Furthermore, if the component functions are nonnegative, we are in the setting of optimization under the interpolation assumption and the method reduces to SGD with the recently proposed stochastic Polyak step size. For general Bregman projections, our method is a stochastic mirror descent with a novel adaptive step size. We prove that in the convex setting each iteration of our method results in a smaller Bregman distance to exact solutions as compared to the standard Polyak step. Our generalization to Bregman projections comes with the price that a convex one-dimensional optimization problem needs to be solved in each iteration. This can typically be done with globalized Newton iterations. Convergence is proved in two classical settings of nonlinearity: for convex nonnegative functions and locally for functions which fulfill the tangential cone condition. Finally, we show examples in which the proposed method outperforms similar methods with the same memory requirements.
We present methodology for constructing pointwise confidence intervals for the cumulative distribution function and the quantiles of mixing distributions on the unit interval from binomial mixture distribution samples. No assumptions are made on the shape of the mixing distribution. The confidence intervals are constructed by inverting exact tests of composite null hypotheses regarding the mixing distribution. Our method may be applied to any deconvolution approach that produces test statistics whose distribution is stochastically monotone for stochastic increase of the mixing distribution. We propose a hierarchical Bayes approach, which uses finite Polya Trees for modelling the mixing distribution, that provides stable and accurate deconvolution estimates without the need for additional tuning parameters. Our main technical result establishes the stochastic monotonicity property of the test statistics produced by the hierarchical Bayes approach. Leveraging the need for the stochastic monotonicity property, we explicitly derive the smallest asymptotic confidence intervals that may be constructed using our methodology. Raising the question whether it is possible to construct smaller confidence intervals for the mixing distribution without making parametric assumptions on its shape.
A recent line of work has shown the unconditional advantage of constant-depth quantum computation, or $\mathsf{QNC^0}$, over $\mathsf{NC^0}$, $\mathsf{AC^0}$, and related models of classical computation. Problems exhibiting this advantage include search and sampling tasks related to the parity function, and it is natural to ask whether $\mathsf{QNC^0}$ can be used to help compute parity itself. We study $\mathsf{AC^0\circ QNC^0}$ -- a hybrid circuit model where $\mathsf{AC^0}$ operates on measurement outcomes of a $\mathsf{QNC^0}$ circuit, and conjecture $\mathsf{AC^0\circ QNC^0}$ cannot even achieve $\Omega(1)$ correlation with parity. As evidence for this conjecture, we prove: $\bullet$ When the $\mathsf{QNC^0}$ circuit is ancilla-free, this model achieves only negligible correlation with parity. $\bullet$ For the general (non-ancilla-free) case, we show via a connection to nonlocal games that the conjecture holds for any class of postprocessing functions that has approximate degree $o(n)$ and is closed under restrictions, even when the $\mathsf{QNC^0}$ circuit is given arbitrary quantum advice. By known results this confirms the conjecture for linear-size $\mathsf{AC^0}$ circuits. $\bullet$ Towards the a switching lemma for $\mathsf{AC^0\circ QNC^0}$, we study the effect of quantum preprocessing on the decision tree complexity of Boolean functions. We find that from this perspective, nonlocal channels are no better than randomness: a Boolean function $f$ precomposed with an $n$-party nonlocal channel is together equal to a randomized decision tree with worst-case depth at most $\mathrm{DT}_\mathrm{depth}[f]$. Our results suggest that while $\mathsf{QNC^0}$ is surprisingly powerful for search and sampling, that power is "locked away" in the global correlations of its output, inaccessible to simple classical computation for solving decision problems.