Making use of a newly developed package in the computer algebra system SageMath, we show how to perform a full asymptotic analysis by means of the Mellin transform with explicit error bounds. As an application of the method, we answer a question of B\'ona and DeJonge on 132-avoiding permutations with a unique longest increasing subsequence that can be translated into an inequality for a certain binomial sum.
Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.
An essential problem in statistics and machine learning is the estimation of expectations involving PDFs with intractable normalizing constants. The self-normalized importance sampling (SNIS) estimator, which normalizes the IS weights, has become the standard approach due to its simplicity. However, the SNIS has been shown to exhibit high variance in challenging estimation problems, e.g, involving rare events or posterior predictive distributions in Bayesian statistics. Further, most of the state-of-the-art adaptive importance sampling (AIS) methods adapt the proposal as if the weights had not been normalized. In this paper, we propose a framework that considers the original task as estimation of a ratio of two integrals. In our new formulation, we obtain samples from a joint proposal distribution in an extended space, with two of its marginals playing the role of proposals used to estimate each integral. Importantly, the framework allows us to induce and control a dependency between both estimators. We propose a construction of the joint proposal that decomposes in two (multivariate) marginals and a coupling. This leads to a two-stage framework suitable to be integrated with existing or new AIS and/or variational inference (VI) algorithms. The marginals are adapted in the first stage, while the coupling can be chosen and adapted in the second stage. We show in several examples the benefits of the proposed methodology, including an application to Bayesian prediction with misspecified models.
To analyze the topological properties of the given discrete data, one needs to consider a continuous transform called filtration. Persistent homology serves as a tool to track changes of homology in the filtration. The outcome of the topological analysis of data varies depending on the choice of filtration, making the selection of filtration crucial. Filtration learning is an attempt to find an optimal filtration that minimizes the loss function. Exact Multi-parameter Persistent Homology (EMPH) has been recently proposed, particularly for topological time-series analysis, that utilizes the exact formula of rank invariant instead of calculating it. In this paper, we propose a framework for filtration learning of EMPH. We formulate an optimization problem and propose an algorithm for solving the problem. We then apply the proposed algorithm to several classification problems. Particularly, we derive the exact formula of the gradient of the loss function with respect to the filtration parameter, which makes it possible to directly update the filtration without using automatic differentiation, significantly enhancing the learning process.
In the study of extremes, the presence of asymptotic independence signifies that extreme events across multiple variables are probably less likely to occur together. Although well-understood in a bivariate context, the concept remains relatively unexplored when addressing the nuances of joint occurrence of extremes in higher dimensions. In this paper, we propose a notion of mutual asymptotic independence to capture the behavior of joint extremes in dimensions larger than two and contrast it with the classical notion of (pairwise) asymptotic independence. Furthermore, we define $k$-wise asymptotic independence which lies in between pairwise and mutual asymptotic independence. The concepts are compared using examples of Archimedean, Gaussian and Marshall-Olkin copulas among others. Notably, for the popular Gaussian copula, we provide explicit conditions on the correlation matrix for mutual asymptotic independence to hold; moreover, we are able to compute exact tail orders for various tail events.
This note discusses a simple modification of cross-conformal prediction inspired by recent work on e-values. The precursor of conformal prediction developed in the 1990s by Gammerman, Vapnik, and Vovk was also based on e-values and is called conformal e-prediction in this note. Replacing e-values by p-values led to conformal prediction, which has important advantages over conformal e-prediction without obvious disadvantages. The situation with cross-conformal prediction is, however, different: whereas for cross-conformal prediction validity is only an empirical fact (and can be broken with excessive randomization), this note draws the reader's attention to the obvious fact that cross-conformal e-prediction enjoys a guaranteed property of validity.
This paper studies the influence of probabilism and non-determinism on some quantitative aspect X of the execution of a system modeled as a Markov decision process (MDP). To this end, the novel notion of demonic variance is introduced: For a random variable X in an MDP M, it is defined as 1/2 times the maximal expected squared distance of the values of X in two independent execution of M in which also the non-deterministic choices are resolved independently by two distinct schedulers. It is shown that the demonic variance is between 1 and 2 times as large as the maximal variance of X in M that can be achieved by a single scheduler. This allows defining a non-determinism score for M and X measuring how strongly the difference of X in two executions of M can be influenced by the non-deterministic choices. Properties of MDPs M with extremal values of the non-determinism score are established. Further, the algorithmic problems of computing the maximal variance and the demonic variance are investigated for two random variables, namely weighted reachability and accumulated rewards. In the process, also the structure of schedulers maximizing the variance and of scheduler pairs realizing the demonic variance is analyzed.
In this paper, we propose nonparametric estimators for varextropy function of an absolutely continuous random variable. Consistency of the estimators is established under suitable regularity conditions. Moreover, a simulation study is performed to compare the performance of the proposed estimators based on mean squared error (MSE) and bias. Furthermore, by using the proposed estimators some tests are constructed for uniformity. It is shown that the varextropybased test proposed here performs well compared to the power of the other uniformity hypothesis tests.
Rational approximation is a powerful tool to obtain accurate surrogates for nonlinear functions that are easy to evaluate and linearize. The interpolatory adaptive Antoulas--Anderson (AAA) method is one approach to construct such approximants numerically. For large-scale vector- and matrix-valued functions, however, the direct application of the set-valued variant of AAA becomes inefficient. We propose and analyze a new sketching approach for such functions called sketchAAA that, with high probability, leads to much better approximants than previously suggested approaches while retaining efficiency. The sketching approach works in a black-box fashion where only evaluations of the nonlinear function at sampling points are needed. Numerical tests with nonlinear eigenvalue problems illustrate the efficacy of our approach, with speedups above 200 for sampling large-scale black-box functions without sacrificing on accuracy.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.