We explore the analytic properties of the density function $ h(x;\gamma,\alpha) $, $ x \in (0,\infty) $, $ \gamma > 0 $, $ 0 < \alpha < 1 $ which arises from the domain of attraction problem for a statistic interpolating between the supremum and sum of random variables. The parameter $ \alpha $ controls the interpolation between these two cases, while $ \gamma $ parametrises the type of extreme value distribution from which the underlying random variables are drawn from. For $ \alpha = 0 $ the Fr\'echet density applies, whereas for $ \alpha = 1 $ we identify a particular Fox H-function, which are a natural extension of hypergeometric functions into the realm of fractional calculus. In contrast for intermediate $ \alpha $ an entirely new function appears, which is not one of the extensions to the hypergeometric function considered to date. We derive series, integral and continued fraction representations of this latter function.
Many applications rely on solving time-dependent partial differential equations (PDEs) that include second derivatives. Summation-by-parts (SBP) operators are crucial for developing stable, high-order accurate numerical methodologies for such problems. Conventionally, SBP operators are tailored to the assumption that polynomials accurately approximate the solution, and SBP operators should thus be exact for them. However, this assumption falls short for a range of problems for which other approximation spaces are better suited. We recently addressed this issue and developed a theory for first-derivative SBP operators based on general function spaces, coined function-space SBP (FSBP) operators. In this paper, we extend the innovation of FSBP operators to accommodate second derivatives. The developed second-derivative FSBP operators maintain the desired mimetic properties of existing polynomial SBP operators while allowing for greater flexibility by being applicable to a broader range of function spaces. We establish the existence of these operators and detail a straightforward methodology for constructing them. By exploring various function spaces, including trigonometric, exponential, and radial basis functions, we illustrate the versatility of our approach. We showcase the superior performance of these non-polynomial FSBP operators over traditional polynomial-based operators for a suite of one- and two-dimensional problems, encompassing a boundary layer problem and the viscous Burgers' equation. The work presented here opens up possibilities for using second-derivative SBP operators based on suitable function spaces, paving the way for a wide range of applications in the future.
Hidden Markov models (HMMs) are flexible tools for clustering dependent data coming from unknown populations, allowing nonparametric modelling of the population densities. Identifiability fails when the data is in fact independent, and we study the frontier between learnable and unlearnable two-state nonparametric HMMs. Interesting new phenomena emerge when the cluster distributions are modelled via density functions (the 'emission' densities) belonging to standard smoothness classes compared to the multinomial setting. Notably, in contrast to the multinomial setting previously considered, the identification of a direction separating the two emission densities becomes a critical, and challenging, issue. Surprisingly, it is possible to "borrow strength" from estimators of the smoother density to improve estimation of the other. We conduct precise analysis of minimax rates, showing a transition depending on the relative smoothnesses of the emission densities.
Context. Algorithmic racism is the term used to describe the behavior of technological solutions that constrains users based on their ethnicity. Lately, various data-driven software systems have been reported to discriminate against Black people, either for the use of biased data sets or due to the prejudice propagated by software professionals in their code. As a result, Black people are experiencing disadvantages in accessing technology-based services, such as housing, banking, and law enforcement. Goal. This study aims to explore algorithmic racism from the perspective of software professionals. Method. A survey questionnaire was applied to explore the understanding of software practitioners on algorithmic racism, and data analysis was conducted using descriptive statistics and coding techniques. Results. We obtained answers from a sample of 73 software professionals discussing their understanding and perspectives on algorithmic racism in software development. Our results demonstrate that the effects of algorithmic racism are well-known among practitioners. However, there is no consensus on how the problem can be effectively addressed in software engineering. In this paper, some solutions to the problem are proposed based on the professionals' narratives. Conclusion. Combining technical and social strategies, including training on structural racism for software professionals, is the most promising way to address the algorithmic racism problem and its effects on the software solutions delivered to our society.
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to the intrinsic data structures. In real world applications, such an assumption of data lying exactly on a low dimensional manifold is stringent. This paper introduces a relaxed assumption that the input data are concentrated around a subset of $\mathbb{R}^d$ denoted by $\mathcal{S}$, and the intrinsic dimension of $\mathcal{S}$ can be characterized by a new complexity notation -- effective Minkowski dimension. We prove that, the sample complexity of deep nonparametric regression only depends on the effective Minkowski dimension of $\mathcal{S}$ denoted by $p$. We further illustrate our theoretical findings by considering nonparametric regression with an anisotropic Gaussian random design $N(0,\Sigma)$, where $\Sigma$ is full rank. When the eigenvalues of $\Sigma$ have an exponential or polynomial decay, the effective Minkowski dimension of such an Gaussian random design is $p=\mathcal{O}(\sqrt{\log n})$ or $p=\mathcal{O}(n^\gamma)$, respectively, where $n$ is the sample size and $\gamma\in(0,1)$ is a small constant depending on the polynomial decay rate. Our theory shows that, when the manifold assumption does not hold, deep neural networks can still adapt to the effective Minkowski dimension of the data, and circumvent the curse of the ambient dimensionality for moderate sample sizes.
We study statistical inference for the optimal transport (OT) map (also known as the Brenier map) from a known absolutely continuous reference distribution onto an unknown finitely discrete target distribution. We derive limit distributions for the $L^p$-error with arbitrary $p \in [1,\infty)$ and for linear functionals of the empirical OT map, together with their moment convergence. The former has a non-Gaussian limit, whose explicit density is derived, while the latter attains asymptotic normality. For both cases, we also establish consistency of the nonparametric bootstrap. The derivation of our limit theorems relies on new stability estimates of functionals of the OT map with respect to the dual potential vector, which may be of independent interest. We also discuss applications of our limit theorems to the construction of confidence sets for the OT map and inference for a maximum tail correlation.
We consider the problem of recovering conditional independence relationships between $p$ jointly distributed Hilbertian random elements given $n$ realizations thereof. We operate in the sparse high-dimensional regime, where $n \ll p$ and no element is related to more than $d \ll p$ other elements. In this context, we propose an infinite-dimensional generalization of the graphical lasso. We prove model selection consistency under natural assumptions and extend many classical results to infinite dimensions. In particular, we do not require finite truncation or additional structural restrictions. The plug-in nature of our method makes it applicable to any observational regime, whether sparse or dense, and indifferent to serial dependence. Importantly, our method can be understood as naturally arising from a coherent maximum likelihood philosophy.
We study the complexity of randomized computation of integrals depending on a parameter, with integrands from Sobolev spaces. That is, for $r,d_1,d_2\in{\mathbb N}$, $1\le p,q\le \infty$, $D_1= [0,1]^{d_1}$, and $D_2= [0,1]^{d_2}$ we are given $f\in W_p^r(D_1\times D_2)$ and we seek to approximate $$ Sf=\int_{D_2}f(s,t)dt\quad (s\in D_1), $$ with error measured in the $L_q(D_1)$-norm. Our results extend previous work of Heinrich and Sindambiwe (J.\ Complexity, 15 (1999), 317--341) for $p=q=\infty$ and Wiegand (Shaker Verlag, 2006) for $1\le p=q<\infty$. Wiegand's analysis was carried out under the assumption that $W_p^r(D_1\times D_2)$ is continuously embedded in $C(D_1\times D_2)$ (embedding condition). We also study the case that the embedding condition does not hold. For this purpose a new ingredient is developed -- a stochastic discretization technique. The paper is based on Part I, where vector valued mean computation -- the finite-dimensional counterpart of parametric integration -- was studied. In Part I a basic problem of Information-Based Complexity on the power of adaption for linear problems in the randomized setting was solved. Here a further aspect of this problem is settled.
The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.
This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.