Linkage disequilibrium score regression (LDSC) has emerged as an essential tool for genetic and genomic analyses of complex traits, utilizing high-dimensional data derived from genome-wide association studies (GWAS). LDSC computes the linkage disequilibrium (LD) scores using an external reference panel, and integrates the LD scores with only summary data from the original GWAS. In this paper, we investigate LDSC within a fixed-effect data integration framework, underscoring its ability to merge multi-source GWAS data and reference panels. In particular, we take account of the genome-wide dependence among the high-dimensional GWAS summary statistics, along with the block-diagonal dependence pattern in estimated LD scores. Our analysis uncovers several key factors of both the original GWAS and reference panel datasets that determine the performance of LDSC. We show that it is relatively feasible for LDSC-based estimators to achieve asymptotic normality when applied to genome-wide genetic variants (e.g., in genetic variance and covariance estimation), whereas it becomes considerably challenging when we focus on a much smaller subset of genetic variants (e.g., in partitioned heritability analysis). Moreover, by modeling the disparities in LD patterns across different populations, we unveil that LDSC can be expanded to conduct cross-ancestry analyses using data from distinct global populations (such as European and Asian). We validate our theoretical findings through extensive numerical evaluations using real genetic data from the UK Biobank study.
We consider a general linear parabolic problem with extended time boundary conditions (including initial value problems and periodic ones), and approximate it by the implicit Euler scheme in time and the Gradient Discretisation method in space; the latter is in fact a class of methods that includes conforming and nonconforming finite elements, discontinuous Galerkin methods and several others. The main result is an error estimate which holds without supplementary regularity hypothesis on the solution. This result states that the approximation error has the same order as the sum of the interpolation error and the conformity error. The proof of this result relies on an inf-sup inequality in Hilbert spaces which can be used both in the continuous and the discrete frameworks. The error estimate result is illustrated by numerical examples with low regularity of the solution.
The distributed task allocation problem, as one of the most interesting distributed optimization challenges, has received considerable research attention recently. Previous works mainly focused on the task allocation problem in a population of individuals, where there are no constraints for affording task amounts. The latter condition, however, cannot always be hold. In this paper, we study the task allocation problem with constraints of task allocation in a game-theoretical framework. We assume that each individual can afford different amounts of task and the cost function is convex. To investigate the problem in the framework of population games, we construct a potential game and calculate the fitness function for each individual. We prove that when the Nash equilibrium point in the potential game is in the feasible solutions for the limited task allocation problem, the Nash equilibrium point is the unique globally optimal solution. Otherwise, we also derive analytically the unique globally optimal solution. In addition, in order to confirm our theoretical results, we consider the exponential and quadratic forms of cost function for each agent. Two algorithms with the mentioned representative cost functions are proposed to numerically seek the optimal solution to the limited task problems. We further perform Monte Carlo simulations which provide agreeing results with our analytical calculations.
A standard approach to solve ordinary differential equations, when they describe dynamical systems, is to adopt a Runge-Kutta or related scheme. Such schemes, however, are not applicable to the large class of equations which do not constitute dynamical systems. In several physical systems, we encounter integro-differential equations with memory terms where the time derivative of a state variable at a given time depends on all past states of the system. Secondly, there are equations whose solutions do not have well-defined Taylor series expansion. The Maxey-Riley-Gatignol equation, which describes the dynamics of an inertial particle in nonuniform and unsteady flow, displays both challenges. We use it as a test bed to address the questions we raise, but our method may be applied to all equations of this class. We show that the Maxey-Riley-Gatignol equation can be embedded into an extended Markovian system which is constructed by introducing a new dynamical co-evolving state variable that encodes memory of past states. We develop a Runge-Kutta algorithm for the resultant Markovian system. The form of the kernels involved in deriving the Runge-Kutta scheme necessitates the use of an expansion in powers of $t^{1/2}$. Our approach naturally inherits the benefits of standard time-integrators, namely a constant memory storage cost, a linear growth of operational effort with simulation time, and the ability to restart a simulation with the final state as the new initial condition.
The forecasting and computation of the stability of chaotic systems from partial observations are tasks for which traditional equation-based methods may not be suitable. In this computational paper, we propose data-driven methods to (i) infer the dynamics of unobserved (hidden) chaotic variables (full-state reconstruction); (ii) time forecast the evolution of the full state; and (iii) infer the stability properties of the full state. The tasks are performed with long short-term memory (LSTM) networks, which are trained with observations (data) limited to only part of the state: (i) the low-to-high resolution LSTM (LH-LSTM), which takes partial observations as training input, and requires access to the full system state when computing the loss; and (ii) the physics-informed LSTM (PI-LSTM), which is designed to combine partial observations with the integral formulation of the dynamical system's evolution equations. First, we derive the Jacobian of the LSTMs. Second, we analyse a chaotic partial differential equation, the Kuramoto-Sivashinsky (KS), and the Lorenz-96 system. We show that the proposed networks can forecast the hidden variables, both time-accurately and statistically. The Lyapunov exponents and covariant Lyapunov vectors, which characterize the stability of the chaotic attractors, are correctly inferred from partial observations. Third, the PI-LSTM outperforms the LH-LSTM by successfully reconstructing the hidden chaotic dynamics when the input dimension is smaller or similar to the Kaplan-Yorke dimension of the attractor. This work opens new opportunities for reconstructing the full state, inferring hidden variables, and computing the stability of chaotic systems from partial data.
Quantization summarizes continuous distributions by calculating a discrete approximation. Among the widely adopted methods for data quantization is Lloyd's algorithm, which partitions the space into Vorono\"i cells, that can be seen as clusters, and constructs a discrete distribution based on their centroids and probabilistic masses. Lloyd's algorithm estimates the optimal centroids in a minimal expected distance sense, but this approach poses significant challenges in scenarios where data evaluation is costly, and relates to rare events. Then, the single cluster associated to no event takes the majority of the probability mass. In this context, a metamodel is required and adapted sampling methods are necessary to increase the precision of the computations on the rare clusters.
A central problem in computational statistics is to convert a procedure for sampling combinatorial from an objects into a procedure for counting those objects, and vice versa. Weconsider sampling problems coming from *Gibbs distributions*, which are probability distributions of the form $\mu^\Omega_\beta(\omega) \propto e^{\beta H(\omega)}$ for $\beta$ in an interval $[\beta_\min, \beta_\max]$ and $H( \omega ) \in \{0 \} \cup [1, n]$. The *partition function* is the normalization factor $Z(\beta)=\sum_{\omega \in\Omega}e^{\beta H(\omega)}$. Two important parameters are the log partition ratio $q = \log \tfrac{Z(\beta_\max)}{Z(\beta_\min)}$ and the vector of counts $c_x = |H^{-1}(x)|$. Our first result is an algorithm to estimate the counts $c_x$ using roughly $\tilde O( \frac{q}{\epsilon^2})$ samples for general Gibbs distributions and $\tilde O( \frac{n^2}{\epsilon^2} )$ samples for integer-valued distributions (ignoring some second-order terms and parameters). We show this is optimal up to logarithmic factors. We illustrate with improved algorithms for counting connected subgraphs and perfect matchings in a graph. We develop a key subroutine for global estimation of the partition function. Specifically, we produce a data structure to estimate $Z(\beta)$ for \emph{all} values $\beta$, without further samples. Constructing the data structure requires $O(\frac{q \log n}{\epsilon^2})$ samples for general Gibbs distributions and $O(\frac{n^2 \log n}{\epsilon^2} + n \log q)$ samples for integer-valued distributions. This improves over a prior algorithm of Kolmogorov (2018) which computes the single point estimate $Z(\beta_\max)$ using $\tilde O(\frac{q}{\epsilon^2})$ samples. We also show that this complexity is optimal as a function of $n$ and $q$ up to logarithmic terms.
Interpreting natural language is an increasingly important task in computer algorithms due to the growing availability of unstructured textual data. Natural Language Processing (NLP) applications rely on semantic networks for structured knowledge representation. The fundamental properties of semantic networks must be taken into account when designing NLP algorithms, yet they remain to be structurally investigated. We study the properties of semantic networks from ConceptNet, defined by 7 semantic relations from 11 different languages. We find that semantic networks have universal basic properties: they are sparse, highly clustered, and many exhibit power-law degree distributions. Our findings show that the majority of the considered networks are scale-free. Some networks exhibit language-specific properties determined by grammatical rules, for example networks from highly inflected languages, such as e.g. Latin, German, French and Spanish, show peaks in the degree distribution that deviate from a power law. We find that depending on the semantic relation type and the language, the link formation in semantic networks is guided by different principles. In some networks the connections are similarity-based, while in others the connections are more complementarity-based. Finally, we demonstrate how knowledge of similarity and complementarity in semantic networks can improve NLP algorithms in missing link inference.
Optimal transport has gained much attention in image processing field, such as computer vision, image interpolation and medical image registration. Recently, Bredies et al. (ESAIM:M2AN 54:2351-2382, 2020) and Schmitzer et al. (IEEE T MED IMAGING 39:1626-1635, 2019) established the framework of optimal transport regularization for dynamic inverse problems. In this paper, we incorporate Wasserstein distance, together with total variation, into static inverse problems as a prior regularization. The Wasserstein distance formulated by Benamou-Brenier energy measures the similarity between the given template and the reconstructed image. Also, we analyze the existence of solutions of such variational problem in Radon measure space. Moreover, the first-order primal-dual algorithm is constructed for solving this general imaging problem in a specific grid strategy. Finally, numerical experiments for undersampled MRI reconstruction are presented which show that our proposed model can recover images well with high quality and structure preservation.
Time-dependent basis reduced order models (TDB ROMs) have successfully been used for approximating the solution to nonlinear stochastic partial differential equations (PDEs). For many practical problems of interest, discretizing these PDEs results in massive matrix differential equations (MDEs) that are too expensive to solve using conventional methods. While TDB ROMs have the potential to significantly reduce this computational burden, they still suffer from the following challenges: (i) inefficient for general nonlinearities, (ii) intrusive implementation, (iii) ill-conditioned in the presence of small singular values, and (iv) error accumulation due to fixed rank. To this end, we present a scalable method based on oblique projections for solving TDB ROMs that is computationally efficient, minimally intrusive, robust in the presence of small singular values, rank-adaptive, and highly parallelizable. These favorable properties are achieved via low-rank approximation of the time discrete MDE. Using the discrete empirical interpolation method (DEIM), a low-rank decomposition is computed at each iteration of the time stepping scheme, enabling a near-optimal approximation at a fraction of the cost. We coin the new approach TDB-CUR since it is equivalent to a CUR decomposition based on sparse row and column samples of the MDE. We also propose a rank-adaptive procedure to control the error on-the-fly. Numerical results demonstrate the accuracy, efficiency, and robustness of the new method for a diverse set of problems.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.