Categorical variables are of uttermost importance in biomedical research. When two of them are considered, it is often the case that one wants to test whether or not they are statistically dependent. We show weaknesses of classical methods -- such as Pearson's and the G-test -- and we propose testing strategies based on distances that lack those drawbacks. We first develop this theory for classical two-dimensional contingency tables, within the context of distance covariance, an association measure that characterises general statistical independence of two variables. We then apply the same fundamental ideas to one-dimensional tables, namely to the testing for goodness of fit to a discrete distribution, for which we resort to an analogous statistic called energy distance. We prove that our methodology has desirable theoretical properties, and we show how we can calibrate the null distribution of our test statistics without resorting to any resampling technique. We illustrate all this in simulations, as well as with some real data examples, demonstrating the adequate performance of our approach for biostatistical practice.
Capturing the extremal behaviour of data often requires bespoke marginal and dependence models which are grounded in rigorous asymptotic theory, and hence provide reliable extrapolation into the upper tails of the data-generating distribution. We present a toolbox of four methodological frameworks, motivated by modern extreme value theory, that can be used to accurately estimate extreme exceedance probabilities or the corresponding level in either a univariate or multivariate setting. Our frameworks were used to facilitate the winning contribution of Team Yalla to the EVA (2023) Conference Data Challenge, which was organised for the 13$^\text{th}$ International Conference on Extreme Value Analysis. This competition comprised seven teams competing across four separate sub-challenges, with each requiring the modelling of data simulated from known, yet highly complex, statistical distributions, and extrapolation far beyond the range of the available samples in order to predict probabilities of extreme events. Data were constructed to be representative of real environmental data, sampled from the fantasy country of "Utopia"
In evidence synthesis, effect modifiers are typically described as variables that induce treatment effect heterogeneity at the individual level, through treatment-covariate interactions in an outcome model parametrized at such level. As such, effect modification is defined with respect to a conditional measure, but marginal effect estimates are required for population-level decisions in health technology assessment. For non-collapsible measures, purely prognostic variables that are not determinants of treatment response at the individual level may modify marginal effects, even where there is individual-level treatment effect homogeneity. With heterogeneity, marginal effects for measures that are not directly collapsible cannot be expressed in terms of marginal covariate moments, and generally depend on the joint distribution of conditional effect measure modifiers and purely prognostic variables. There are implications for recommended practices in evidence synthesis. Unadjusted anchored indirect comparisons can be biased in the absence of individual-level treatment effect heterogeneity, or when marginal covariate moments are balanced across studies. Covariate adjustment may be necessary to account for cross-study imbalances in joint covariate distributions involving purely prognostic variables. In the absence of individual patient data for the target, covariate adjustment approaches are inherently limited in their ability to remove bias for measures that are not directly collapsible. Directly collapsible measures would facilitate the transportability of marginal effects between studies by: (1) reducing dependence on model-based covariate adjustment where there is individual-level treatment effect homogeneity or marginal covariate moments are balanced; and (2) facilitating the selection of baseline covariates for adjustment where there is individual-level treatment effect heterogeneity.
Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character $n$-grams and English $when$. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.
Quantum hypothesis testing (QHT) has been traditionally studied from the information-theoretic perspective, wherein one is interested in the optimal decay rate of error probabilities as a function of the number of samples of an unknown state. In this paper, we study the sample complexity of QHT, wherein the goal is to determine the minimum number of samples needed to reach a desired error probability. By making use of the wealth of knowledge that already exists in the literature on QHT, we characterize the sample complexity of binary QHT in the symmetric and asymmetric settings, and we provide bounds on the sample complexity of multiple QHT. In more detail, we prove that the sample complexity of symmetric binary QHT depends logarithmically on the inverse error probability and inversely on the negative logarithm of the fidelity. As a counterpart of the quantum Stein's lemma, we also find that the sample complexity of asymmetric binary QHT depends logarithmically on the inverse type II error probability and inversely on the quantum relative entropy. We then provide lower and upper bounds on the sample complexity of multiple QHT, with it remaining an intriguing open question to improve these bounds. The final part of our paper outlines and reviews how sample complexity of QHT is relevant to a broad swathe of research areas and can enhance understanding of many fundamental concepts, including quantum algorithms for simulation and search, quantum learning and classification, and foundations of quantum mechanics. As such, we view our paper as an invitation to researchers coming from different communities to study and contribute to the problem of sample complexity of QHT, and we outline a number of open directions for future research.
Scopus and the Web of Science have been the foundation for research in the science of science even though these traditional databases systematically underrepresent certain disciplines and world regions. In response, new inclusive databases, notably OpenAlex, have emerged. While many studies have begun using OpenAlex as a data source, few critically assess its limitations. This study, conducted in collaboration with the OpenAlex team, addresses this gap by comparing OpenAlex to Scopus across a number of dimensions. The analysis concludes that OpenAlex is a superset of Scopus and can be a reliable alternative for some analyses, particularly at the country level. Despite this, issues of metadata accuracy and completeness show that additional research is needed to fully comprehend and address OpenAlex's limitations. Doing so will be necessary to confidently use OpenAlex across a wider set of analyses, including those that are not at all possible with more constrained databases.
We investigate the bounding problem of causal effects in experimental studies in which the outcome is truncated by death, meaning that the subject dies before the outcome can be measured. Causal effects cannot be point identified without instruments and/or tight parametric assumptions but can be bounded under mild restrictions. Previous work on partial identification under the principal stratification framework has primarily focused on the `always-survivor' subpopulation. In this paper, we present a novel nonparametric unified framework to provide sharp bounds on causal effects on discrete and continuous square-integrable outcomes. These bounds are derived on the `always-survivor', `protected', and `harmed' subpopulations and on the entire population with/without assumptions of monotonicity and stochastic dominance. The main idea depends on rewriting the optimization problem in terms of the integrated tail probability expectation formula using a set of conditional probability distributions. The proposed procedure allows for settings with any type and number of covariates, and can be extended to incorporate average causal effects and complier average causal effects. Furthermore, we present several simulation studies conducted under various assumptions as well as the application of the proposed approach to a real dataset from the National Supported Work Demonstration.
Correspondence analysis (CA) is a popular technique to visualize the relationship between two categorical variables. CA uses the data from a two-way contingency table and is affected by the presence of outliers. The supplementary points method is a popular method to handle outliers. Its disadvantage is that the information from entire rows or columns is removed. However, outliers can be caused by cells only. In this paper, a reconstitution algorithm is introduced to cope with such cells. This algorithm can reduce the contribution of cells in CA instead of deleting entire rows or columns. Thus the remaining information in the row and column involved can be used in the analysis. The reconstitution algorithm is compared with two alternative methods for handling outliers, the supplementary points method and MacroPCA. It is shown that the proposed strategy works well.
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern 1) a taxonomy and extensive overview of the state-of-the-art, 2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner, 3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time, and storage.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.