In this article, we propose a two-sample test for functional observations modeled as elements of a separable Hilbert space. We present a general recipe for constructing a measure of dissimilarity between the distributions of two Hilbertian random variables and study the theoretical properties of one such measure which is constructed using Maximum Mean Discrepancy (MMD) on random linear projections of the distributions and aggregating them. We propose a data-driven estimate of this measure and use it as the test statistic. Large sample distributions of this statistic are derived both under null and alternative hypotheses. This test statistic involves a kernel function and the associated bandwidth. We prove that the resulting test has large-sample consistency for any data-driven choice of bandwidth that converges in probability to a positive number. Since the theoretical quantiles of the limiting null distribution are intractable, in practice, the test is calibrated using the permutation method. We also derive the limiting distribution of the permuted test statistic and the asymptotic power of the permutation test under local contiguous alternatives. This shows that the permutation test is consistent and statistically efficient in the Pitman sense. Extensive simulation studies are carried out and a real data set is analyzed to compare the performance of our proposed test with some state-of-the-art methods.
Multivariate distributional forecasts have become widespread in recent years. To assess the quality of such forecasts, suitable evaluation methods are needed. In the univariate case, calibration tests based on the probability integral transform (PIT) are routinely used. However, multivariate extensions of PIT-based calibration tests face various challenges. We therefore introduce a general framework for calibration testing in the multivariate case and propose two new tests that arise from it. Both approaches use proper scoring rules and are simple to implement even in large dimensions. The first employs the PIT of the score. The second is based on comparing the expected performance of the forecast distribution (i.e., the expected score) to its actual performance based on realized observations (i.e., the realized score). The tests have good size and power properties in simulations and solve various problems of existing tests. We apply the new tests to forecast distributions for macroeconomic and financial time series data.
Discovering causal relations from observational data is important. The existence of unobserved variables (e.g. latent confounding or mediation) can mislead the causal identification. To overcome this problem, proximal causal discovery methods attempted to adjust for the bias via the proxy of the unobserved variable. Particularly, hypothesis test-based methods proposed to identify the causal edge by testing the induced violation of linearity. However, these methods only apply to discrete data with strict level constraints, which limits their practice in the real world. In this paper, we fix this problem by extending the proximal hypothesis test to cases where the system consists of continuous variables. Our strategy is to present regularity conditions on the conditional distributions of the observed variables given the hidden factor, such that if we discretize its observed proxy with sufficiently fine, finite bins, the involved discretization error can be effectively controlled. Based on this, we can convert the problem of testing continuous causal relations to that of testing discrete causal relations in each bin, which can be effectively solved with existing methods. These non-parametric regularities we present are mild and can be satisfied by a wide range of structural causal models. Using both simulated and real-world data, we show the effectiveness of our method in recovering causal relations when unobserved variables exist.
Causal intervention is an essential tool in causal inference. It is axiomatized under the rules of do-calculus in the case of structure causal models. We provide simple axiomatizations for families of probability distributions to be different types of interventional distributions. Our axiomatizations neatly lead to a simple and clear theory of causality that has several advantages: it does not need to make use of any modeling assumptions such as those imposed by structural causal models; it only relies on interventions on single variables; it includes most cases with latent variables and causal cycles; and more importantly, it does not assume the existence of an underlying true causal graph--in fact, a causal graph is a by-product of our theory. We show that, under our axiomatizations, the intervened distributions are Markovian to the defined intervened causal graphs, and an observed joint probability distribution is Markovian to the obtained causal graph; these results are consistent with the case of structural causal models, and as a result, the existing theory of causal inference applies. We also show that a large class of natural structural causal models satisfy the theory presented here.
Fitness functions map large combinatorial spaces of biological sequences to properties of interest. Inferring these multimodal functions from experimental data is a central task in modern protein engineering. Global epistasis models are an effective and physically-grounded class of models for estimating fitness functions from observed data. These models assume that a sparse latent function is transformed by a monotonic nonlinearity to emit measurable fitness. Here we demonstrate that minimizing contrastive loss functions, such as the Bradley-Terry loss, is a simple and flexible technique for extracting the sparse latent function implied by global epistasis. We argue by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss (a common practice). We show that contrastive losses are able to accurately estimate a ranking function from limited data even in regimes where MSE is ineffective. We validate the practical utility of this insight by showing contrastive loss functions result in consistently improved performance on benchmark tasks.
We propose a new auto-regressive model for the statistical analysis of multivariate distributional time series. The data of interest consist of a collection of multiple series of probability measures supported over a bounded interval of the real line, and that are indexed by distinct time instants. The probability measures are modelled as random objects in the Wasserstein space. We establish the auto-regressive model in the tangent space at the Lebesgue measure by first centering all the raw measures so that their Fr\'echet means turn to be the Lebesgue measure. Using the theory of iterated random function systems, results on the existence, uniqueness and stationarity of the solution of such a model are provided. We also propose a consistent estimator for the model coefficient. In addition to the analysis of simulated data, the proposed model is illustrated with two real data sets made of observations from age distribution in different countries and bike sharing network in Paris. Finally, due to the positive and boundedness constraints that we impose on the model coefficients, the proposed estimator that is learned under these constraints, naturally has a sparse structure. The sparsity allows furthermore the application of the proposed model in learning a graph of temporal dependency from the multivariate distributional time series.
For terminal value problems of fractional differential equations of order $\alpha \in (0,1)$ that use Caputo derivatives, shooting methods are a well developed and investigated approach. Based on recently established analytic properties of such problems, we develop a new technique to select the required initial values that solves such shooting problems quickly and accurately. Numerical experiments indicate that this new proportional secting technique converges very quickly and accurately to the solution. Run time measurements indicate a speedup factor of between 4 and 10 when compared to the standard bisection method.
Feature screening approaches are effective in selecting active features from data with ultrahigh dimensionality and increasing complexity; however, the majority of existing feature screening approaches are either restricted to a univariate response or rely on some distribution or model assumptions. In this article, we propose a novel sure independence screening approach based on the multivariate rank distance correlation (MrDc-SIS). The MrDc-SIS achieves multiple desirable properties such as being distribution-free, completely nonparametric, scale-free, robust for outliers or heavy tails, and sensitive for hidden structures. Moreover, the MrDc-SIS can be used to screen either univariate or multivariate responses and either one dimensional or multi-dimensional predictors. We establish the asymptotic sure screening consistency property of the MrDc-SIS under a mild condition by lifting previous assumptions about the finite moments. Simulation studies demonstrate that MrDc-SIS outperforms three other closely relevant approaches under various settings. We also apply the MrDc-SIS approach to a multi-omics ovarian carcinoma data downloaded from The Cancer Genome Atlas (TCGA).
We present prompt distribution learning for effectively adapting a pre-trained vision-language model to address downstream recognition tasks. Our method not only learns low-bias prompts from a few samples but also captures the distribution of diverse prompts to handle the varying visual representations. In this way, we provide high-quality task-related content for facilitating recognition. This prompt distribution learning is realized by an efficient approach that learns the output embeddings of prompts instead of the input embeddings. Thus, we can employ a Gaussian distribution to model them effectively and derive a surrogate loss for efficient training. Extensive experiments on 12 datasets demonstrate that our method consistently and significantly outperforms existing methods. For example, with 1 sample per category, it relatively improves the average result by 9.1% compared to human-crafted prompts.
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.
Classic machine learning methods are built on the $i.i.d.$ assumption that training and testing data are independent and identically distributed. However, in real scenarios, the $i.i.d.$ assumption can hardly be satisfied, rendering the sharp drop of classic machine learning algorithms' performances under distributional shifts, which indicates the significance of investigating the Out-of-Distribution generalization problem. Out-of-Distribution (OOD) generalization problem addresses the challenging setting where the testing distribution is unknown and different from the training. This paper serves as the first effort to systematically and comprehensively discuss the OOD generalization problem, from the definition, methodology, evaluation to the implications and future directions. Firstly, we provide the formal definition of the OOD generalization problem. Secondly, existing methods are categorized into three parts based on their positions in the whole learning pipeline, namely unsupervised representation learning, supervised model learning and optimization, and typical methods for each category are discussed in detail. We then demonstrate the theoretical connections of different categories, and introduce the commonly used datasets and evaluation metrics. Finally, we summarize the whole literature and raise some future directions for OOD generalization problem. The summary of OOD generalization methods reviewed in this survey can be found at //out-of-distribution-generalization.com.