Any discrete distribution with support on $\{0,\ldots, d\}$ can be constructed as the distribution of sums of Bernoulli variables. We prove that the class of $d$-dimensional Bernoulli variables $\boldsymbol{X}=(X_1,\ldots, X_d)$ whose sums $\sum_{i=1}^dX_i$ have the same distribution $p$ is a convex polytope $\mathcal{P}(p)$ and we analytically find its extremal points. Our main result is to prove that the Hausdorff measure of the polytopes $\mathcal{P}(p), p\in \mathcal{D}_d,$ is a continuous function $l(p)$ over $\mathcal{D}_d$ and it is the density of a finite measure $\mu_s$ on $\mathcal{D}_d$ that is Hausdorff absolutely continuous. We also prove that the measure $\mu_s$ normalized over the simplex $\mathcal{D}$ belongs to the class of Dirichlet distributions. We observe that the symmetric binomial distribution is the mean of the Dirichlet distribution on $\mathcal{D}$ and that when $d$ increases it converges to the mode.
We systematically investigate the preservation of differential privacy in functional data analysis, beginning with functional mean estimation and extending to varying coefficient model estimation. Our work introduces a distributed learning framework involving multiple servers, each responsible for collecting several sparsely observed functions. This hierarchical setup introduces a mixed notion of privacy. Within each function, user-level differential privacy is applied to $m$ discrete observations. At the server level, central differential privacy is deployed to account for the centralised nature of data collection. Across servers, only private information is exchanged, adhering to federated differential privacy constraints. To address this complex hierarchy, we employ minimax theory to reveal several fundamental phenomena: from sparse to dense functional data analysis, from user-level to central and federated differential privacy costs, and the intricate interplay between different regimes of functional data analysis and privacy preservation. To the best of our knowledge, this is the first study to rigorously examine functional data estimation under multiple privacy constraints. Our theoretical findings are complemented by efficient private algorithms and extensive numerical evidence, providing a comprehensive exploration of this challenging problem.
One can recover vectors from $\mathbb{R}^m$ with arbitrary precision, using only $\lceil \log_2(m+1)\rceil +1$ continuous measurements that are chosen adaptively. This surprising result is explained and discussed, and we present applications to infinite-dimensional approximation problems.
Stable distributions are a celebrated class of probability laws used in various fields. The $\alpha$-stable process, and its exponentially tempered counterpart, the Classical Tempered Stable (CTS) process, are also prominent examples of L\'evy processes. Simulating these processes is critical for many applications, yet it remains computationally challenging, due to their infinite jump activity. This survey provides an overview of the key properties of these objects offering a roadmap for practitioners. The first part is a review of the stability property, sampling algorithms are provided along with numerical illustrations. Then CTS processes are presented, with the Baeumer-Meerschaert algorithm for increment simulation, and a computational analysis is provided with numerical illustrations across different time scales.
We consider the problem of estimating the error when solving a system of differential algebraic equations. Richardson extrapolation is a classical technique that can be used to judge when computational errors are irrelevant and estimate the discretization error. We have simulated molecular dynamics with constraints using the GROMACS library and found that the output is not always amenable to Richardson extrapolation. We derive and illustrate Richardson extrapolation using a variety of numerical experiments. We identify two necessary conditions that are not always satisfied by the GROMACS library.
We study a cost-aware programming language for higher-order recursion dubbed $\textbf{PCF}_\mathsf{cost}$ in the setting of synthetic domain theory (SDT). Our main contribution relates the denotational cost semantics of $\textbf{PCF}_\mathsf{cost}$ to its computational cost semantics, a new kind of dynamic semantics for program execution that serves as a mathematically natural alternative to operational semantics in SDT. In particular we prove an internal, cost-sensitive version of Plotkin's computational adequacy theorem, giving a precise correspondence between the denotational and computational semantics for complete programs at base type. The constructions and proofs of this paper take place in the internal dependent type theory of an SDT topos extended by a phase distinction in the sense of Sterling and Harper. By controlling the interpretation of cost structure via the phase distinction in the denotational semantics, we show that $\textbf{PCF}_\mathsf{cost}$ programs also evince a noninterference property of cost and behavior. We verify the axioms of the type theory by means of a model construction based on relative sheaf models of SDT.
Clustering and outlier detection are two important tasks in data mining. Outliers frequently interfere with clustering algorithms to determine the similarity between objects, resulting in unreliable clustering results. Currently, only a few clustering algorithms (e.g., DBSCAN) have the ability to detect outliers to eliminate interference. For other clustering algorithms, it is tedious to introduce another outlier detection task to eliminate outliers before each clustering process. Obviously, how to equip more clustering algorithms with outlier detection ability is very meaningful. Although a common strategy allows clustering algorithms to detect outliers based on the distance between objects and clusters, it is contradictory to improving the performance of clustering algorithms on the datasets with outliers. In this paper, we propose a novel outlier detection approach, called ODAR, for clustering. ODAR maps outliers and normal objects into two separated clusters by feature transformation. As a result, any clustering algorithm can detect outliers by identifying clusters. Experiments show that ODAR is robust to diverse datasets. Compared with baseline methods, the clustering algorithms achieve the best on 7 out of 10 datasets with the help of ODAR, with at least 5% improvement in accuracy.
We develop some graph-based tests for spherical symmetry of a multivariate distribution using a method based on data augmentation. These tests are constructed using a new notion of signs and ranks that are computed along a path obtained by optimizing an objective function based on pairwise dissimilarities among the observations in the augmented data set. The resulting tests based on these signs and ranks have the exact distribution-free property, and irrespective of the dimension of the data, the null distributions of the test statistics remain the same. These tests can be conveniently used for high-dimensional data, even when the dimension is much larger than the sample size. Under appropriate regularity conditions, we prove the consistency of these tests in high dimensional asymptotic regime, where the dimension grows to infinity while the sample size may or may not grow with the dimension. We also propose a generalization of our methods to take care of the situations, where the center of symmetry is not specified by the null hypothesis. Several simulated data sets and a real data set are analyzed to demonstrate the utility of the proposed tests.
Under a multinormal distribution with an arbitrary unknown covariance matrix, the main purpose of this paper is to propose a framework to achieve the goal of reconciliation of Bayesian, frequentist, and Fisher's reporting $p$-values, Neyman-Pearson's optimal theory and Wald's decision theory for the problems of testing mean against restricted alternatives (closed convex cones). To proceed, the tests constructed via the likelihood ratio (LR) and the union-intersection (UI) principles are studied. For the problems of testing against restricted alternatives, first, we show that the LRT and the UIT are not the proper Bayes tests, however, they are shown to be the integrated LRT and the integrated UIT, respectively. For the problem of testing against the positive orthant space alternative, both the null distributions of the LRT and the UIT depend on the unknown nuisance covariance matrix. Hence we have difficulty adopting Fisher's approach to reporting $p$-values. On the other hand, according to the definition of the level of significance, both the LRT and the UIT are shown to be power-dominated by the corresponding LRT and UIT for testing against the half-space alternative, respectively. Hence, both the LRT and the UIT are $\alpha$-inadmissible, these results are against the common statistical sense. Neither Fisher's approach of reporting $p$-values alone nor Neyman-Pearson's optimal theory for power function alone is a satisfactory criterion for evaluating the performance of tests. Wald's decision theory via $d$-admissibility may shed light on resolving these challenging issues of imposing the balance between type 1 error and power.
Many articles have recently been devoted to Mahler equations, partly because of their links with other branches of mathematics such as automata theory. Hahn series (a generalization of the Puiseux series allowing arbitrary exponents of the indeterminate as long as the set that supports them is well-ordered) play a central role in the theory of Mahler equations. In this paper, we address the following fundamental question: is there an algorithm to calculate the Hahn series solutions of a given linear Mahler equation? What makes this question interesting is the fact that the Hahn series appearing in this context can have complicated supports with infinitely many accumulation points. Our (positive) answer to the above question involves among other things the construction of a computable well-ordered receptacle for the supports of the potential Hahn series solutions.
We prove new complexity results for computational problems in certain wreath products of groups and (as an application) for free solvable group. For a finitely generated group we study the so-called power word problem (does a given expression $u_1^{k_1} \ldots u_d^{k_d}$, where $u_1, \ldots, u_d$ are words over the group generators and $k_1, \ldots, k_d$ are binary encoded integers, evaluate to the group identity?) and knapsack problem (does a given equation $u_1^{x_1} \ldots u_d^{x_d} = v$, where $u_1, \ldots, u_d,v$ are words over the group generators and $x_1,\ldots,x_d$ are variables, has a solution in the natural numbers). We prove that the power word problem for wreath products of the form $G \wr \mathbb{Z}$ with $G$ nilpotent and iterated wreath products of free abelian groups belongs to $\mathsf{TC}^0$. As an application of the latter, the power word problem for free solvable groups is in $\mathsf{TC}^0$. On the other hand we show that for wreath products $G \wr \mathbb{Z}$, where $G$ is a so called uniformly strongly efficiently non-solvable group (which form a large subclass of non-solvable groups), the power word problem is $\mathsf{coNP}$-hard. For the knapsack problem we show $\mathsf{NP}$-completeness for iterated wreath products of free abelian groups and hence free solvable groups. Moreover, the knapsack problem for every wreath product $G \wr \mathbb{Z}$, where $G$ is uniformly efficiently non-solvable, is $\Sigma^2_p$-hard.