The two-sample problem consists in testing whether two independent samples are drawn from the same (unknown) probability distribution. It finds applications in many areas, ranging from clinical trials to data attribute matching. Its study in high-dimension is the subject of much attention, in particular as the information acquisition processes can involve various sources being often poorly controlled, possibly leading to datasets with strong sampling bias that may jeopardize their statistical analysis. While classic methods relying on a discrepancy measure between empirical versions of the distributions face the curse of dimensionality, we develop an alternative approach based on statistical learning and extending rank tests, known to be asymptotically optimal for univariate data when appropriately designed. Overcoming the lack of natural order on high-dimension, it is implemented in two steps. Assigning a label to each sample, and dividing them into two halves, a preorder on the feature space defined by a real-valued scoring function is learned by a bipartite ranking algorithm applied to the first halves. Next, a two-sample homogeneity rank test is applied to the (univariate) scores of the remaining observations. Because it learns how to map the data onto the real line like (any monotone transform of) the likelihood ratio between the original multivariate distributions, the approach is not affected by the dimensionality, ignores ranking model bias issues, and preserves the asymptotic optimality of univariate R-tests, capable of detecting small departures from the null assumption. Beyond a theoretical analysis establishing nonasymptotic bounds for the two types of error of the method based on recent concentration results for two-sample linear R-processes, an extensive experimental study shows higher performance of the proposed method compared to classic ones.
We consider the problem of fairly allocating a set of indivisible goods among $n$ agents with additive valuations, using the popular fairness notion of maximin share (MMS). Since MMS allocations do not always exist, a series of works provided existence and algorithms for approximate MMS allocations. The current best approximation factor, for which the existence is known, is $(\frac{3}{4} + \frac{1}{12n})$ [Garg and Taki, 2021]. Most of these results are based on complicated analyses, especially those providing better than $2/3$ factor. Moreover, since no tight example is known of the Garg-Taki algorithm, it is unclear if this is the best factor of this approach. In this paper, we significantly simplify the analysis of this algorithm and also improve the existence guarantee to a factor of $(\frac{3}{4} + \min(\frac{1}{36}, \frac{3}{16n-4}))$. For small $n$, this provides a noticeable improvement. Furthermore, we present a tight example of this algorithm, showing that this may be the best factor one can hope for with the current techniques.
Statistical analysis of large datasets is a challenge because of the limitation of computing devices' memory and excessive computation time. Divide and Conquer (DC) algorithm is an effective solution path, but the DC algorithm still has limitations for statistical inference. Empirical likelihood is an important semiparametric and nonparametric statistical method for parameter estimation and statistical inference, and the estimating equation builds a bridge between empirical likelihood and traditional statistical methods, which makes empirical likelihood widely used in various traditional statistical models. In this paper, we propose a novel approach to address the challenges posed by empirical likelihood with massive data, which is called split sample mean empirical likelihood(SSMEL), our approach provides a unique perspective for sovling big data problem. We show that the SSMEL estimator has the same estimation efficiency as the empirical likelihood estimator with the full dataset, and maintains the important statistical property of Wilks' theorem, allowing our proposed approach to be used for statistical inference. The effectiveness of the proposed approach is illustrated using simulation studies and real data analysis.
Here, we investigate whether (and how) experimental design could aid in the estimation of the precision matrix in a Gaussian chain graph model, especially the interplay between the design, the effect of the experiment and prior knowledge about the effect. Estimation of the precision matrix is a fundamental task to infer biological graphical structures like microbial networks. We compare the marginal posterior precision of the precision matrix under four priors: flat, conjugate Normal-Wishart, Normal-MGIG and a general independent. Under the flat and conjugate priors, the Laplace-approximated posterior precision is not a function of the design matrix rendering useless any efforts to find an optimal experimental design to infer the precision matrix. In contrast, the Normal-MGIG and general independent priors do allow for the search of optimal experimental designs, yet there is a sharp upper bound on the information that can be extracted from a given experiment. We confirm our theoretical findings via a simulation study comparing i) the KL divergence between prior and posterior and ii) the Stein's loss difference of MAPs between random and no experiment. Our findings provide practical advice for domain scientists conducting experiments to better infer the precision matrix as a representation of a biological network.
Causal Discovery (CD) is the process of identifying the cause-effect relationships among the variables from data. Over the years, several methods have been developed primarily based on the statistical properties of data to uncover the underlying causal mechanism. In this study we introduce the common terminologies in causal discovery, and provide a comprehensive discussion of the approaches designed to identify the causal edges in different settings. We further discuss some of the benchmark datasets available for evaluating the performance of the causal discovery algorithms, available tools to perform causal discovery readily, and the common metrics used to evaluate these methods. Finally, we conclude by presenting the common challenges involved in CD and also, discuss the applications of CD in multiple areas of interest.
We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, we show that when data densities are supported on a $d$-dimensional sub-manifold $\mathcal{M}$ embedded in an $m$-dimensional space, the kernel two-sample test for data sampled from a pair of distributions $p$ and $q$ that are H\"older with order $\beta$ (up to 2) is powerful when the number of samples $n$ is large such that $\Delta_2 \gtrsim n^{- { 2 \beta/( d + 4 \beta ) }}$, where $\Delta_2$ is the squared $L^2$-divergence between $p$ and $q$ on manifold. We establish a lower bound on the test power for finite $n$ that is sufficiently large, where the kernel bandwidth parameter $\gamma$ scales as $n^{-1/(d+4\beta)}$. The analysis extends to cases where the manifold has a boundary, and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test does not have a curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.
In the task of predicting spatio-temporal fields in environmental science using statistical methods, introducing statistical models inspired by the physics of the underlying phenomena that are numerically efficient is of growing interest. Large space-time datasets call for new numerical methods to efficiently process them. The Stochastic Partial Differential Equation (SPDE) approach has proven to be effective for the estimation and the prediction in a spatial context. We present here the advection-diffusion SPDE with first order derivative in time which defines a large class of nonseparable spatio-temporal models. A Gaussian Markov random field approximation of the solution to the SPDE is built by discretizing the temporal derivative with a finite difference method (implicit Euler) and by solving the spatial SPDE with a finite element method (continuous Galerkin) at each time step. The ''Streamline Diffusion'' stabilization technique is introduced when the advection term dominates the diffusion. Computationally efficient methods are proposed to estimate the parameters of the SPDE and to predict the spatio-temporal field by kriging, as well as to perform conditional simulations. The approach is applied to a solar radiation dataset. Its advantages and limitations are discussed.
This paper proposes three new approaches for additive functional regression models with functional responses. The first one is a reformulation of the linear regression model, and the last two are on the yet scarce case of additive nonlinear functional regression models. Both proposals are based on extensions of similar models for scalar responses. One of our nonlinear models is based on constructing a Spectral Additive Model (the word "Spectral" refers to the representation of the covariates in an $\mcal{L}_2$ basis), which is restricted (by construction) to Hilbertian spaces. The other one extends the kernel estimator, and it can be applied to general metric spaces since it is only based on distances. We include our new approaches as well as real datasets in an R package. The performances of the new proposals are compared with previous ones, which we review theoretically and practically in this paper. The simulation results show the advantages of the nonlinear proposals and the small loss of efficiency when the simulation scenario is truly linear. Finally, the supplementary material provides a visualization tool for checking the linearity of the relationship between a single covariate and the response.
Informative cluster size (ICS) arises in situations with clustered data where a latent relationship exists between the number of participants in a cluster and the outcome measures. Although this phenomenon has been sporadically reported in statistical literature for nearly two decades now, further exploration is needed in certain statistical methodologies to avoid potentially misleading inferences. For inference about population quantities without covariates, inverse cluster size reweightings are often employed to adjust for ICS. Further, to study the effect of covariates on disease progression described by a multistate model, the pseudo-value regression technique has gained popularity in time-to-event data analysis. We seek to answer the question: "How to apply pseudo-value regression to clustered time-to-event data when cluster size is informative?" ICS adjustment by the reweighting method can be performed in two steps; estimation of marginal functions of the multistate model and fitting the estimating equations based on pseudo-value responses, leading to four possible strategies. We present theoretical arguments and thorough simulation experiments to ascertain the correct strategy for adjusting for ICS. A further extension of our methodology is implemented to include informativeness induced by the intra-cluster group size. We demonstrate the methods in two real-world applications: (i) to determine predictors of tooth survival in a periodontal study, and (ii) to identify indicators of ambulatory recovery in spinal cord injury patients who participated in locomotor-training rehabilitation.
Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In this paper, we present a novel active learning strategy to assist knowledge transfer in the target domain, dubbed active domain adaptation. We start from an observation that energy-based models exhibit free energy biases when training (source) and test (target) data come from different distributions. Inspired by this inherent mechanism, we empirically reveal that a simple yet efficient energy-based sampling strategy sheds light on selecting the most valuable target samples than existing approaches requiring particular architectures or computation of the distances. Our algorithm, Energy-based Active Domain Adaptation (EADA), queries groups of targe data that incorporate both domain characteristic and instance uncertainty into every selection round. Meanwhile, by aligning the free energy of target data compact around the source domain via a regularization term, domain gap can be implicitly diminished. Through extensive experiments, we show that EADA surpasses state-of-the-art methods on well-known challenging benchmarks with substantial improvements, making it a useful option in the open world. Code is available at //github.com/BIT-DA/EADA.
This PhD thesis contains several contributions to the field of statistical causal modeling. Statistical causal models are statistical models embedded with causal assumptions that allow for the inference and reasoning about the behavior of stochastic systems affected by external manipulation (interventions). This thesis contributes to the research areas concerning the estimation of causal effects, causal structure learning, and distributionally robust (out-of-distribution generalizing) prediction methods. We present novel and consistent linear and non-linear causal effects estimators in instrumental variable settings that employ data-dependent mean squared prediction error regularization. Our proposed estimators show, in certain settings, mean squared error improvements compared to both canonical and state-of-the-art estimators. We show that recent research on distributionally robust prediction methods has connections to well-studied estimators from econometrics. This connection leads us to prove that general K-class estimators possess distributional robustness properties. We, furthermore, propose a general framework for distributional robustness with respect to intervention-induced distributions. In this framework, we derive sufficient conditions for the identifiability of distributionally robust prediction methods and present impossibility results that show the necessity of several of these conditions. We present a new structure learning method applicable in additive noise models with directed trees as causal graphs. We prove consistency in a vanishing identifiability setup and provide a method for testing substructure hypotheses with asymptotic family-wise error control that remains valid post-selection. Finally, we present heuristic ideas for learning summary graphs of nonlinear time-series models.