The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly correlated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.
Markov categories have recently turned out to be a powerful high-level framework for probability and statistics. They accommodate purely categorical definitions of notions like conditional probability and almost sure equality, as well as proofs of fundamental results such as the Hewitt-Savage 0/1 Law, the de Finetti Theorem and the Ergodic Decomposition Theorem. In this work, we develop additional relevant notions from probability theory in the setting of Markov categories. This comprises improved versions of previously introduced definitions of absolute continuity and supports, as well as a detailed study of idempotents and idempotent splitting in Markov categories. Our main result on idempotent splitting is that every idempotent measurable Markov kernel between standard Borel spaces splits through another standard Borel space, and we derive this as an instance of a general categorical criterion for idempotent splitting in Markov categories.
We prove a discrete analogue for the composition of the fractional integral and Caputo derivative. This result is relevant in numerical analysis of fractional PDEs when one discretizes the Caputo derivative with the so-called L1 scheme. The proof is based on asymptotic evaluation of the discrete sums with the use of the Euler-Maclaurin summation formula.
Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. When parametric assumptions do not hold or are difficult to verify, non-parametric regression models can provide a convenient alternative method for prediction. To this end, we consider an extension to the classical $k$--$NN$ regression, termed $\alpha$--$k$--$NN$ regression, that yields a highly flexible non-parametric regression model for compositional data through the use of the $\alpha$-transformation. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed models without modification. Extensive simulation studies and real-life data analyses highlight the advantage of using these non-parametric regressions for complex relationships between the compositional response data and Euclidean predictor variables. Both suggest that $\alpha$--$k$--$NN$ regression can lead to more accurate predictions compared to current regression models which assume a, sometimes restrictive, parametric relationship with the predictor variables. In addition, the $\alpha$--$k$--$NN$ regression, in contrast to current regression techniques, enjoys a high computational efficiency rendering it highly attractive for use with large scale, massive, or big data.
Estimating parameters from data is a fundamental problem in physics, customarily done by minimizing a loss function between a model and observed statistics. In scattering-based analysis, researchers often employ their domain expertise to select a specific range of wavevectors for analysis, a choice that can vary depending on the specific case. We introduce another paradigm that defines a probabilistic generative model from the beginning of data processing and propagates the uncertainty for parameter estimation, termed ab initio uncertainty quantification (AIUQ). As an illustrative example, we demonstrate this approach with differential dynamic microscopy (DDM) that extracts dynamical information through Fourier analysis at a selected range of wavevectors. We first show that DDM is equivalent to fitting a temporal variogram in the reciprocal space using a latent factor model as the generative model. Then we derive the maximum marginal likelihood estimator, which optimally weighs information at all wavevectors, therefore eliminating the need to select the range of wavevectors. Furthermore, we substantially reduce the computational cost by utilizing the generalized Schur algorithm for Toeplitz covariances without approximation. Simulated studies validate that AIUQ significantly improves estimation accuracy and enables model selection with automated analysis. The utility of AIUQ is also demonstrated by three distinct sets of experiments: first in an isotropic Newtonian fluid, pushing limits of optically dense systems compared to multiple particle tracking; next in a system undergoing a sol-gel transition, automating the determination of gelling points and critical exponent; and lastly, in discerning anisotropic diffusive behavior of colloids in a liquid crystal. These outcomes collectively underscore AIUQ's versatility to capture system dynamics in an efficient and automated manner.
Bayesian binary regression is a prosperous area of research due to the computational challenges encountered by currently available methods either for high-dimensional settings or large datasets, or both. In the present work, we focus on the expectation propagation (EP) approximation of the posterior distribution in Bayesian probit regression under a multivariate Gaussian prior distribution. Adapting more general derivations in Anceschi et al. (2023), we show how to leverage results on the extended multivariate skew-normal distribution to derive an efficient implementation of the EP routine having a per-iteration cost that scales linearly in the number of covariates. This makes EP computationally feasible also in challenging high-dimensional settings, as shown in a detailed simulation study.
Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-function regression problem, and we show how to extend it to the scalar-on-function framework. Our method, called FAStEN, combines functional data, optimization, and machine learning techniques to perform feature selection and parameter estimation simultaneously. We exploit the properties of Functional Principal Components and the sparsity inherent to the Dual Augmented Lagrangian problem to significantly reduce computational cost, and we introduce an adaptive scheme to improve selection accuracy. In addition, we derive asymptotic oracle properties, which guarantee estimation and selection consistency for the proposed FAStEN estimator. Through an extensive simulation study, we benchmark our approach to the best existing competitors and demonstrate a massive gain in terms of CPU time and selection performance, without sacrificing the quality of the coefficients' estimation. The theoretical derivations and the simulation study provide a strong motivation for our approach. Finally, we present an application to brain fMRI data from the AOMIC PIOP1 study.
Determining the number of factors in high-dimensional factor modeling is essential but challenging, especially when the data are heavy-tailed. In this paper, we introduce a new estimator based on the spectral properties of Spearman sample correlation matrix under the high-dimensional setting, where both dimension and sample size tend to infinity proportionally. Our estimator is robust against heavy tails in either the common factors or idiosyncratic errors. The consistency of our estimator is established under mild conditions. Numerical experiments demonstrate the superiority of our estimator compared to existing methods.
Hawkes processes are often applied to model dependence and interaction phenomena in multivariate event data sets, such as neuronal spike trains, social interactions, and financial transactions. In the nonparametric setting, learning the temporal dependence structure of Hawkes processes is generally a computationally expensive task, all the more with Bayesian estimation methods. In particular, for generalised nonlinear Hawkes processes, Monte-Carlo Markov Chain methods applied to compute the doubly intractable posterior distribution are not scalable to high-dimensional processes in practice. Recently, efficient algorithms targeting a mean-field variational approximation of the posterior distribution have been proposed. In this work, we first unify existing variational Bayes approaches under a general nonparametric inference framework, and analyse the asymptotic properties of these methods under easily verifiable conditions on the prior, the variational class, and the nonlinear model. Secondly, we propose a novel sparsity-inducing procedure, and derive an adaptive mean-field variational algorithm for the popular sigmoid Hawkes processes. Our algorithm is parallelisable and therefore computationally efficient in high-dimensional setting. Through an extensive set of numerical simulations, we also demonstrate that our procedure is able to adapt to the dimensionality of the parameter of the Hawkes process, and is partially robust to some type of model mis-specification.
For multivariate data with noise variables, tandem clustering is a well-known technique that aims to improve cluster identification by first reducing the dimension. However, the usual approach using principal component analysis (PCA) has been criticized for focusing only on inertia so that the first components do not necessarily retain the structure of interest for clustering. To overcome this drawback, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while returning affine invariant components. Some theoretical results have already been derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. Nevertheless, ICS has received little attention in a clustering context. Two challenges are the choice of the pair of scatter matrices and the selection of the components to retain. For clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix that captures the within-cluster structure and another that captures the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully selected subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure present in data. In an extensive simulation study and in empirical applications with benchmark data sets, different combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the approach with PCA.
Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small $p$-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small $p$-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small $p$-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.