Cosine similarity is an established similarity metric for computing associations on vectors, and it is commonly used to identify related samples from biological perturbational data. The distribution of cosine similarity changes with the covariance of the data, and this in turn affects the statistical power to identify related signals. The relationship between the mean and covariance of the distribution of the data and the distribution of cosine similarity is poorly understood. In this work, we derive the asymptotic moments of cosine similarity as a function of the data and identify the criteria of the data covariance matrix that minimize the variance of cosine similarity. We find that the variance of cosine similarity is minimized when the eigenvalues of the covariance matrix are equal for centered data. One immediate application of this work is characterizing the null distribution of cosine similarity over a dataset with non-zero covariance structure. Furthermore, this result can be used to optimize over a set of transformations or representations on a dataset to maximize power, recall, or other discriminative metrics, with direct application to noisy biological data. While we consider the specific biological domain of perturbational data analysis, our result has potential application for any use of cosine similarity or Pearson's correlation on data with covariance structure.
Penalizing complexity (PC) priors is a principled framework for designing priors that reduce model complexity. PC priors penalize the Kullback-Leibler Divergence (KLD) between the distributions induced by a ``simple'' model and that of a more complex model. However, in many common cases, it is impossible to construct a prior in this way because the KLD is infinite. Various approximations are used to mitigate this problem, but the resulting priors then fail to follow the designed principles. We propose a new class of priors, the Wasserstein complexity penalization (WCP) priors, by replacing KLD with the Wasserstein distance in the PC prior framework. These priors avoid the infinite model distance issues and can be derived by following the principles exactly, making them more interpretable. Furthermore, principles and recipes to construct joint WCP priors for multiple parameters analytically and numerically are proposed and we show that they can be easily obtained, either numerically or analytically, for a general class of models. The methods are illustrated through several examples for which PC priors have previously been applied.
Distributed quantum computing, particularly distributed quantum machine learning, has gained substantial prominence for its capacity to harness the collective power of distributed quantum resources, transcending the limitations of individual quantum nodes. Meanwhile, the critical concern of privacy within distributed computing protocols remains a significant challenge, particularly in standard classical federated learning (FL) scenarios where data of participating clients is susceptible to leakage via gradient inversion attacks by the server. This paper presents innovative quantum protocols with quantum communication designed to address the FL problem, strengthen privacy measures, and optimize communication efficiency. In contrast to previous works that leverage expressive variational quantum circuits or differential privacy techniques, we consider gradient information concealment using quantum states and propose two distinct FL protocols, one based on private inner-product estimation and the other on incremental learning. These protocols offer substantial advancements in privacy preservation with low communication resources, forging a path toward efficient quantum communication-assisted FL protocols and contributing to the development of secure distributed quantum machine learning, thus addressing critical privacy concerns in the quantum computing era.
We extend the use of piecewise orthogonal collocation to computing periodic solutions of renewal equations, which are particularly important in modeling population dynamics. We prove convergence through a rigorous error analysis. Finally, we show some numerical experiments confirming the theoretical results, and a couple of applications in view of bifurcation analysis.
Strong spatial mixing (SSM) is an important quantitative notion of correlation decay for Gibbs distributions arising in statistical physics, probability theory, and theoretical computer science. A longstanding conjecture is that the uniform distribution on proper $q$-colorings on a $\Delta$-regular tree exhibits SSM whenever $q \ge \Delta+1$. Moreover, it is widely believed that as long as SSM holds on bounded-degree trees with $q$ colors, one would obtain an efficient sampler for $q$-colorings on all bounded-degree graphs via simple Markov chain algorithms. It is surprising that such a basic question is still open, even on trees, but then again it also highlights how much we still have to learn about random colorings. In this paper, we show the following: (1) For any $\Delta \ge 3$, SSM holds for random $q$-colorings on trees of maximum degree $\Delta$ whenever $q \ge \Delta + 3$. Thus we almost fully resolve the aforementioned conjecture. Our result substantially improves upon the previously best bound which requires $q \ge 1.59\Delta+\gamma^*$ for an absolute constant $\gamma^* > 0$. (2) For any $\Delta\ge 3$ and girth $g = \Omega_\Delta(1)$, we establish optimal mixing of the Glauber dynamics for $q$-colorings on graphs of maximum degree $\Delta$ and girth $g$ whenever $q \ge \Delta+3$. Our approach is based on a new general reduction from spectral independence on large-girth graphs to SSM on trees that is of independent interest. Using the same techniques, we also prove near-optimal bounds on weak spatial mixing (WSM), a closely-related notion to SSM, for the antiferromagnetic Potts model on trees.
We introduce a family of identities that express general linear non-unitary evolution operators as a linear combination of unitary evolution operators, each solving a Hamiltonian simulation problem. This formulation can exponentially enhance the accuracy of the recently introduced linear combination of Hamiltonian simulation (LCHS) method [An, Liu, and Lin, Physical Review Letters, 2023]. For the first time, this approach enables quantum algorithms to solve linear differential equations with both optimal state preparation cost and near-optimal scaling in matrix queries on all parameters.
The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are H\"older continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which -- combined with classical tests for discrete-valued predictors -- can be used to test the calibration of virtually any probabilistic classification method.
Most state-of-the-art machine learning techniques revolve around the optimisation of loss functions. Defining appropriate loss functions is therefore critical to successfully solving problems in this field. We present a survey of the most commonly used loss functions for a wide range of different applications, divided into classification, regression, ranking, sample generation and energy based modelling. Overall, we introduce 33 different loss functions and we organise them into an intuitive taxonomy. Each loss function is given a theoretical backing and we describe where it is best used. This survey aims to provide a reference of the most essential loss functions for both beginner and advanced machine learning practitioners.
The goal of explainable Artificial Intelligence (XAI) is to generate human-interpretable explanations, but there are no computationally precise theories of how humans interpret AI generated explanations. The lack of theory means that validation of XAI must be done empirically, on a case-by-case basis, which prevents systematic theory-building in XAI. We propose a psychological theory of how humans draw conclusions from saliency maps, the most common form of XAI explanation, which for the first time allows for precise prediction of explainee inference conditioned on explanation. Our theory posits that absent explanation humans expect the AI to make similar decisions to themselves, and that they interpret an explanation by comparison to the explanations they themselves would give. Comparison is formalized via Shepard's universal law of generalization in a similarity space, a classic theory from cognitive science. A pre-registered user study on AI image classifications with saliency map explanations demonstrate that our theory quantitatively matches participants' predictions of the AI.
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.