The Kullback-Leibler (KL) divergence is frequently used in data science. For discrete distributions on large state spaces, approximations of probability vectors may result in a few small negative entries, rendering the KL divergence undefined. We address this problem by introducing a parameterized family of substitute divergence measures, the shifted KL (sKL) divergence measures. Our approach is generic and does not increase the computational overhead. We show that the sKL divergence shares important theoretical properties with the KL divergence and discuss how its shift parameters should be chosen. If Gaussian noise is added to a probability vector, we prove that the average sKL divergence converges to the KL divergence for small enough noise. We also show that our method solves the problem of negative entries in an application from computational oncology, the optimization of Mutual Hazard Networks for cancer progression using tensor-train approximations.
The study on the generating function approach to entropy become popular as it generates several well-known entropy measures discussed in the literature. In this work, we define the weighted cumulative residual entropy generating function (WCREGF) and study its properties. We then introduce the dynamic weighted cumulative residual entropy generating function (DWCREGF). It is shown that the DWCREGF determines the distribution uniquely. We study some characterization results using the relationship between the DWCREGF and the hazard rate and/or the mean residual life function. Using a characterization based on DWCREGF, we develop a new goodness fit test for Rayleigh distribution. A Monte Carlo simulation study is conducted to evaluate the proposed test. Finally, the test is illustrated using two real data sets.
Complex interval arithmetic is a powerful tool for the analysis of computational errors. The naturally arising rectangular, polar, and circular (together called primitive) interval types are not closed under simple arithmetic operations and their use yields overly relaxed bounds. The later introduced polygonal type, on the other hand, allows for arbitrarily precise representaion of the above operations for a higher computational cost. We propose the polyarcular interval type as an effective extension of the previous types. The polyarcular interval can represent all primitive intervals and most of their arithmetic combinations precisely and has a approximation capability competing with that of the polygonal interval. In particular, in antenna tolerance analysis it can achieve perfect accuracy for lower computational cost then the polygonal type, which we show in a relevant case study. In this paper, we present a rigorous analysis of the arithmetic properties of all five interval types, involving a new algebro-geometric method of boundary analysis.
We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form of a pair of coupled partial differential equations: a forward-controlled Fokker-Planck (FP) equation, and a backward Hamilton-Jacobi-Bellman (HJB) equation. Via a Cole-Hopf transformation and taking advantage of the fact that the cross-entropy can be related to a linear functional of the density, we show that the HJB equation is an uncontrolled FP equation. Second, with the mathematical structure at hand, we present an interpretable kernel-based model for the score function which dramatically improves the performance of SGMs in terms of training samples and training time. In addition, the WPO-informed kernel model is explicitly constructed to avoid the recently studied memorization effects of score-based generative models. The mathematical form of the new kernel-based models in combination with the use of the terminal condition of the MFG reveals new explanations for the manifold learning and generalization properties of SGMs, and provides a resolution to their memorization effects. Finally, our mathematically informed, interpretable kernel-based model suggests new scalable bespoke neural network architectures for high-dimensional applications.
A deep generative model yields an implicit estimator for the unknown distribution or density function of the observation. This paper investigates some statistical properties of the implicit density estimator pursued by VAE-type methods from a nonparametric density estimation framework. More specifically, we obtain convergence rates of the VAE-type density estimator under the assumption that the underlying true density function belongs to a locally H\"{o}lder class. Remarkably, a near minimax optimal rate with respect to the Hellinger metric can be achieved by the simplest network architecture, a shallow generative model with a one-dimensional latent variable.
We propose a generalization of nonlinear stability of numerical one-step integrators to Riemannian manifolds in the spirit of Butcher's notion of B-stability. Taking inspiration from Simpson-Porco and Bullo, we introduce non-expansive systems on such manifolds and define B-stability of integrators. In this first exposition, we provide concrete results for a geodesic version of the Implicit Euler (GIE) scheme. We prove that the GIE method is B-stable on Riemannian manifolds with non-positive sectional curvature. We show through numerical examples that the GIE method is expansive when applied to a certain non-expansive vector field on the 2-sphere, and that the GIE method does not necessarily possess a unique solution for large enough step sizes. Finally, we derive a new improved global error estimate for general Lie group integrators.
We formulate a uniform tail bound for empirical processes indexed by a class of functions, in terms of the individual deviations of the functions rather than the worst-case deviation in the considered class. The tail bound is established by introducing an initial "deflation" step to the standard generic chaining argument. The resulting tail bound is the sum of the complexity of the "deflated function class" in terms of a generalization of Talagrand's $\gamma$ functional, and the deviation of the function instance, both of which are formulated based on the natural seminorm induced by the corresponding Cram\'{e}r functions. We also provide certain approximations for the mentioned seminorm when the function class lies in a given (exponential type) Orlicz space, that can be used to make the complexity term and the deviation term more explicit.
Block Principal Component Analysis (Block PCA) of a data matrix A, where loadings Z are determined by maximization of AZ 2 over unit norm orthogonal loadings, is difficult to use for the design of sparse PCA by 1 regularization, due to the difficulty of taking care of both the orthogonality constraint on loadings and the non differentiable 1 penalty. Our objective in this paper is to relax the orthogonality constraint on loadings by introducing new objective functions expvar(Y) which measure the part of the variance of the data matrix A explained by correlated components Y = AZ. So we propose first a comprehensive study of mathematical and numerical properties of expvar(Y) for two existing definitions Zou et al. [2006], Shen and Huang [2008] and four new definitions. Then we show that only two of these explained variance are fit to use as objective function in block PCA formulations for A rid of orthogonality constraints.
We introduce a method to construct a stochastic surrogate model from the results of dimensionality reduction in forward uncertainty quantification. The hypothesis is that the high-dimensional input augmented by the output of a computational model admits a low-dimensional representation. This assumption can be met by numerous uncertainty quantification applications with physics-based computational models. The proposed approach differs from a sequential application of dimensionality reduction followed by surrogate modeling, as we "extract" a surrogate model from the results of dimensionality reduction in the input-output space. This feature becomes desirable when the input space is genuinely high-dimensional. The proposed method also diverges from the Probabilistic Learning on Manifold, as a reconstruction mapping from the feature space to the input-output space is circumvented. The final product of the proposed method is a stochastic simulator that propagates a deterministic input into a stochastic output, preserving the convenience of a sequential "dimensionality reduction + Gaussian process regression" approach while overcoming some of its limitations. The proposed method is demonstrated through two uncertainty quantification problems characterized by high-dimensional input uncertainties.
We consider nonparametric Bayesian inference in a multidimensional diffusion model with reflecting boundary conditions based on discrete high-frequency observations. We prove a general posterior contraction rate theorem in $L^2$-loss, which is applied to Gaussian priors. The resulting posteriors, as well as their posterior means, are shown to converge to the ground truth at the minimax optimal rate over H\"older smoothness classes in any dimension. Of independent interest and as part of our proofs, we show that certain frequentist penalized least squares estimators are also minimax optimal.
Collecting large quantities of high-quality data is often prohibitively expensive or impractical, and a crucial bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources like public datasets, data collected under different circumstances, or synthesized by generative models. Blurring distinctions, we refer to such data as `surrogate data'. We define a simple scheme for integrating surrogate data into training and use both theoretical models and empirical studies to explore its behavior. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution; $(ii)$ In order to reap this benefit, it is crucial to use optimally weighted empirical risk minimization; $(iii)$ The test error of models trained on mixtures of real and surrogate data is well described by a scaling law. This can be used to predict the optimal weighting and the gain from surrogate data.