Recombination is a fundamental evolutionary force, but it is difficult to quantify because the effect of a recombination event on patterns of variation in a sample of genetic data can be hard to discern. Estimators for the recombination rate, which are usually based on the idea of integrating over the unobserved possible evolutionary histories of a sample, can therefore be noisy. Here we consider a related question: how would an estimator behave if the evolutionary history actually was observed? This would offer an upper bound on the performance of estimators used in practice. In this paper we derive an expression for the maximum likelihood estimator for the recombination rate based on a continuously observed, multi-locus, Wright--Fisher diffusion of haplotype frequencies, complementing existing work for an estimator of selection. We show that, contrary to selection, the estimator has unusual properties because the observed information matrix can explode in finite time whereupon the recombination parameter is learned without error. We also show that the recombination estimator is robust to the presence of selection in the sense that incorporating selection into the model leaves the estimator unchanged. We study the properties of the estimator by simulation and show that its distribution can be quite sensitive to the underlying mutation rates.
Even when the causal graph underlying our data is unknown, we can use observational data to narrow down the possible values that an average treatment effect (ATE) can take by (1) identifying the graph up to a Markov equivalence class; and (2) estimating that ATE for each graph in the class. While the PC algorithm can identify this class under strong faithfulness assumptions, it can be computationally prohibitive. Fortunately, only the local graph structure around the treatment is required to identify the set of possible ATE values, a fact exploited by local discovery algorithms to improve computational efficiency. In this paper, we introduce Local Discovery using Eager Collider Checks (LDECC), a new local causal discovery algorithm that leverages unshielded colliders to orient the treatment's parents differently from existing methods. We show that there exist graphs where LDECC exponentially outperforms existing local discovery algorithms and vice versa. Moreover, we show that LDECC and existing algorithms rely on different faithfulness assumptions, leveraging this insight to weaken the assumptions for identifying the set of possible ATE values.
Graphical model selection is a seemingly impossible task when many pairs of variables are never jointly observed; this requires inference of conditional dependencies with no observations of corresponding marginal dependencies. This under-explored statistical problem arises in neuroimaging, for example, when different partially overlapping subsets of neurons are recorded in non-simultaneous sessions. We call this statistical challenge the "Graph Quilting" problem. We study this problem in the context of sparse inverse covariance learning, and focus on Gaussian graphical models where we show that missing parts of the covariance matrix yields an unidentifiable precision matrix specifying the graph. Nonetheless, we show that, under mild conditions, it is possible to correctly identify edges connecting the observed pairs of nodes. Additionally, we show that we can recover a minimal superset of edges connecting variables that are never jointly observed. Thus, one can infer conditional relationships even when marginal relationships are unobserved, a surprising result! To accomplish this, we propose an $\ell_1$-regularized partially observed likelihood-based graph estimator and provide performance guarantees in population and in high-dimensional finite-sample settings. We illustrate our approach using synthetic data, as well as for learning functional neural connectivity from calcium imaging data.
A new method, based on Bayesian Networks, to estimate propensity scores is proposed with the purpose to draw causal inference from real world data on the average treatment effect in case of a binary outcome and discrete covariates. The proposed method ensures maximum likelihood properties to the estimated propensity score, i.e. asymptotic efficiency, thus outperforming other available approach. Two point estimators via inverse probability weighting are then proposed, and their main distributional properties are derived for constructing confidence interval and for testing the hypotheses of absence of the treatment effect. Empirical evidence of the substantial improvements offered by the proposed methodology versus standard logistic modelling of propensity score is provided in simulation settings that mimic the characteristics of a real dataset of prostate cancer patients from Milan San Raffaele Hospital.
While linear mixed model (LMM) has shown a competitive performance in correcting spurious associations raised by population stratification, family structures, and cryptic relatedness, more challenges are still to be addressed regarding the complex structure of genotypic and phenotypic data. For example, geneticists have discovered that some clusters of phenotypes are more co-expressed than others. Hence, a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial for genetic modeling. We proposed the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction. Our method is capable of uncovering the genetic associations of a large number of phenotypes together while considering the relatedness of these phenotypes. Through extensive simulation experiments, we show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals. Further, we validate the effectiveness of sGLMM in the real-world genomic dataset on two different species from plants and humans. In Arabidopsis thaliana data, sGLMM behaves better than all other baseline models for 63.4% traits. We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.
Modern time series analysis requires the ability to handle datasets that are inherently high-dimensional; examples include applications in climatology, where measurements from numerous sensors must be taken into account, or inventory tracking of large shops, where the dimension is defined by the number of tracked items. The standard way to mitigate computational issues arising from the high dimensionality of the data is by applying some dimension reduction technique that preserves the structural properties of the ambient space. The dissimilarity between two time series is often measured by ``discrete'' notions of distance, e.g. the dynamic time warping or the discrete Fr\'echet distance. Since all these distance functions are computed directly on the points of a time series, they are sensitive to different sampling rates or gaps. The continuous Fr\'echet distance offers a popular alternative which aims to alleviate this by taking into account all points on the polygonal curve obtained by linearly interpolating between any two consecutive points in a sequence. We study the ability of random projections \`a la Johnson and Lindenstrauss to preserve the continuous Fr\'echet distance of polygonal curves by effectively reducing the dimension. In particular, we show that one can reduce the dimension to $O(\epsilon^{-2} \log N)$, where $N$ is the total number of input points while preserving the continuous Fr\'echet distance between any two determined polygonal curves within a factor of $1\pm \epsilon$. We conclude with applications on clustering.
Detecting the dimensionality of graphs is a central topic in machine learning. While the problem has been tackled empirically as well as theoretically, existing methods have several drawbacks. On the one hand, empirical tools are computationally heavy and lack theoretical foundation. On the other hand, theoretical approaches do not apply to graphs with heterogeneous degree distributions, which is often the case for complex real-world networks. To address these drawbacks, we consider geometric inhomogeneous random graphs (GIRGs) as a random graph model, which captures a variety of properties observed in practice. These include a heterogeneous degree distribution and non-vanishing clustering coefficient, which is the probability that two random neighbours of a vertex are adjacent. In GIRGs, $n$ vertices are distributed on a $d$-dimensional torus and weights are assigned to the vertices according to a power-law distribution. Two vertices are then connected with a probability that depends on their distance and their weights. Our first result shows that the clustering coefficient of GIRGs scales inverse exponentially with respect to the number of dimensions, when the latter is at most logarithmic in $n$. This gives a first theoretical explanation for the low dimensionality of real-world networks observed by Almagro et. al. [Nature '22]. Our second result is a linear-time algorithm for determining the dimensionality of a given GIRG. We prove that our algorithm returns the correct number of dimensions with high probability when the input is a GIRG. As a result, our algorithm bridges the gap between theory and practice, as it not only comes with a rigorous proof of correctness but also yields results comparable to that of prior empirical approaches, as indicated by our experiments on real-world instances.
In group sequential analysis, data is collected and analyzed in batches until pre-defined stopping criteria are met. Inference in the parametric setup typically relies on the limiting asymptotic multivariate normality of the repeatedly computed maximum likelihood estimators (MLEs), a result first rigorously proved by Jennison and Turbull (1997) under general regularity conditions. In this work, using Stein's method we provide optimal order, non-asymptotic bounds on the distance for smooth test functions between the joint group sequential MLEs and the appropriate normal distribution under the same conditions. Our results assume independent observations but allow heterogeneous (i.e., non-identically distributed) data. We examine how the resulting bounds simplify when the data comes from an exponential family. Finally, we present a general result relating multivariate Kolmogorov distance to smooth function distance which, in addition to extending our results to the former metric, may be of independent interest.
The twin support vector machine and its extensions have made great achievements in dealing with binary classification problems. However, it suffers from difficulties in effective solution of multi-classification and fast model selection. This work devotes to the fast regularization parameter tuning algorithm for the twin multi-class support vector machine. Specifically, a novel sample data set partition strategy is first adopted, which is the basis for the model construction. Then, combining the linear equations and block matrix theory, the Lagrangian multipliers are proved to be piecewise linear w.r.t. the regularization parameters, so that the regularization parameters are continuously updated by only solving the break points. Next, Lagrangian multipliers are proved to be 1 as the regularization parameter approaches infinity, thus, a simple yet effective initialization algorithm is devised. Finally, eight kinds of events are defined to seek for the starting event for the next iteration. Extensive experimental results on nine UCI data sets show that the proposed method can achieve comparable classification performance without solving any quadratic programming problem.
Reconfigurable intelligent surface has recently emerged as a promising technology for shaping the wireless environment by leveraging massive low-cost reconfigurable elements. Prior works mainly focus on a single-layer metasurface that lacks the capability of suppressing multiuser interference. By contrast, we propose a stacked intelligent metasurface (SIM)-enabled transceiver design for multiuser multiple-input single-output downlink communications. Specifically, the SIM is endowed with a multilayer structure and is deployed at the base station to perform transmit beamforming directly in the electromagnetic wave domain. As a result, an SIM-enabled transceiver overcomes the need for digital beamforming and operates with low-resolution digital-to-analog converters and a moderate number of radio-frequency chains, which significantly reduces the hardware cost and energy consumption, while substantially decreasing the precoding delay benefiting from the processing performed in the wave domain. To leverage the benefits of SIM-enabled transceivers, we formulate an optimization problem for maximizing the sum rate of all the users by jointly designing the transmit power allocated to them and the analog beamforming in the wave domain. Numerical results based on a customized alternating optimization algorithm corroborate the effectiveness of the proposed SIM-enabled analog beamforming design as compared with various benchmark schemes. Most notably, the proposed analog beamforming scheme is capable of substantially decreasing the precoding delay compared to its digital counterpart.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.