Detecting differences in gene expression is an important part of single-cell RNA sequencing experiments, and many statistical methods have been developed for this aim. Most differential expression analyses focus on comparing expression between two groups (e.g., treatment vs. control). But there is increasing interest in multi-condition differential expression analyses in which expression is measured in many conditions, and the aim is to accurately detect and estimate expression differences in all conditions. We show that directly modeling single-cell RNA-seq counts in all conditions simultaneously, while also inferring how expression differences are shared across conditions, leads to greatly improved performance for detecting and estimating expression differences compared to existing methods. We illustrate the potential of this new approach by analyzing data from a single-cell experiment studying the effects of cytokine stimulation on gene expression. We call our new method "Poisson multivariate adaptive shrinkage", and it is implemented in an R package available online at //github.com/stephenslab/poisson.mash.alpha.
We discuss a connection between a generative model, called the diffusion model, and nonequilibrium thermodynamics for the Fokker-Planck equation, called stochastic thermodynamics. Based on the techniques of stochastic thermodynamics, we derive the speed-accuracy trade-off for the diffusion models, which is a trade-off relationship between the speed and accuracy of data generation in diffusion models. Our result implies that the entropy production rate in the forward process affects the errors in data generation. From a stochastic thermodynamic perspective, our results provide quantitative insight into how best to generate data in diffusion models. The optimal learning protocol is introduced by the conservative force in stochastic thermodynamics and the geodesic of space by the 2-Wasserstein distance in optimal transport theory. We numerically illustrate the validity of the speed-accuracy trade-off for the diffusion models with different noise schedules such as the cosine schedule, the conditional optimal transport, and the optimal transport.
Multi-way data extend two-way matrices into higher-dimensional tensors, often explored through dimensional reduction techniques. In this paper, we study the Parallel Factor Analysis (PARAFAC) model for handling multi-way data, representing it more compactly through a concise set of loading matrices and scores. We assume that the data may be incomplete and could contain both rowwise and cellwise outliers, signifying cases that deviate from the majority and outlying cells dispersed throughout the data array. To address these challenges, we present a novel algorithm designed to robustly estimate both loadings and scores. Additionally, we introduce an enhanced outlier map to distinguish various patterns of outlying behavior. Through simulations and the analysis of fluorescence Excitation-Emission Matrix (EEM) data, we demonstrate the robustness of our approach. Our results underscore the effectiveness of diagnostic tools in identifying and interpreting unusual patterns within the data.
We discuss a connection between a generative model, called the diffusion model, and nonequilibrium thermodynamics for the Fokker-Planck equation, called stochastic thermodynamics. Based on the techniques of stochastic thermodynamics, we derive the speed-accuracy trade-off for the diffusion models, which is a trade-off relationship between the speed and accuracy of data generation in diffusion models. Our result implies that the entropy production rate in the forward process affects the errors in data generation. From a stochastic thermodynamic perspective, our results provide quantitative insight into how best to generate data in diffusion models. The optimal learning protocol is introduced by the conservative force in stochastic thermodynamics and the geodesic of space by the 2-Wasserstein distance in optimal transport theory. We numerically illustrate the validity of the speed-accuracy trade-off for the diffusion models with different noise schedules such as the cosine schedule, the conditional optimal transport, and the optimal transport.
We provide an algorithm for deciding simple grammar bisimilarity whose complexity is polynomial in the valuation of the grammar (maximum seminorm among production rules). Since the valuation is at most exponential in the size of the grammar, this gives rise to a single-exponential running time. Previously only a doubly-exponential algorithm was known. As an application, we provide a conversion from context-free session types to simple grammars whose valuation is linear in the size of the type. In this way, we provide the first polynomial-time algorithm for deciding context-free session type equivalence.
Purpose: Simulation-based digital twins represent an effort to provide high-accuracy real-time insights into operational physical processes. However, the computation time of many multi-physical simulation models is far from real-time. It might even exceed sensible time frames to produce sufficient data for training data-driven reduced-order models. This study presents TwinLab, a framework for data-efficient, yet accurate training of neural-ODE type reduced-order models with only two data sets. Design/methodology/approach: Correlations between test errors of reduced-order models and distinct features of corresponding training data are investigated. Having found the single best data sets for training, a second data set is sought with the help of similarity and error measures to enrich the training process effectively. Findings: Adding a suitable second training data set in the training process reduces the test error by up to 49% compared to the best base reduced-order model trained only with one data set. Such a second training data set should at least yield a good reduced-order model on its own and exhibit higher levels of dissimilarity to the base training data set regarding the respective excitation signal. Moreover, the base reduced-order model should have elevated test errors on the second data set. The relative error of the time series ranges from 0.18% to 0.49%. Prediction speed-ups of up to a factor of 36,000 are observed. Originality: The proposed computational framework facilitates the automated, data-efficient extraction of non-intrusive reduced-order models for digital twins from existing simulation models, independent of the simulation software.
We introduce a fast algorithm for Gaussian process regression in low dimensions, applicable to a widely-used family of non-stationary kernels. The non-stationarity of these kernels is induced by arbitrary spatially-varying vertical and horizontal scales. In particular, any stationary kernel can be accommodated as a special case, and we focus especially on the generalization of the standard Mat\'ern kernel. Our subroutine for kernel matrix-vector multiplications scales almost optimally as $O(N\log N)$, where $N$ is the number of regression points. Like the recently developed equispaced Fourier Gaussian process (EFGP) methodology, which is applicable only to stationary kernels, our approach exploits non-uniform fast Fourier transforms (NUFFTs). We offer a complete analysis controlling the approximation error of our method, and we validate the method's practical performance with numerical experiments. In particular we demonstrate improved scalability compared to to state-of-the-art rank-structured approaches in spatial dimension $d>1$.
This work explores multi-modal inference in a high-dimensional simplified model, analytically quantifying the performance gain of multi-modal inference over that of analyzing modalities in isolation. We present the Bayes-optimal performance and weak recovery thresholds in a model where the objective is to recover the latent structures from two noisy data matrices with correlated spikes. The paper derives the approximate message passing (AMP) algorithm for this model and characterizes its performance in the high-dimensional limit via the associated state evolution. The analysis holds for a broad range of priors and noise channels, which can differ across modalities. The linearization of AMP is compared numerically to the widely used partial least squares (PLS) and canonical correlation analysis (CCA) methods, which are both observed to suffer from a sub-optimal recovery threshold.
Highly resolved finite element simulations of a laser beam welding process are considered. The thermomechanical behavior of this process is modeled with a set of thermoelasticity equations resulting in a nonlinear, nonsymmetric saddle point system. Newton's method is used to solve the nonlinearity and suitable domain decomposition preconditioners are applied to accelerate the convergence of the iterative method used to solve all linearized systems. Since a onelevel Schwarz preconditioner is in general not scalable, a second level has to be added. Therefore, the construction and numerical analysis of a monolithic, twolevel overlapping Schwarz approach with the GDSW (Generalized Dryja-Smith-Widlund) coarse space and an economic variant thereof are presented here.
Pairwise sequence comparison is one of the most fundamental problems in string processing. The most common metric to quantify the similarity between sequences S and T is edit distance, d(S,T), which corresponds to the number of characters that need to be substituted, deleted from, or inserted into S to generate T. However, fewer edit operations may be sufficient for some string pairs to transform one string to the other if larger rearrangements are permitted. Block edit distance refers to such changes in substring level (i.e., blocks) that "penalizes" entire block removals, insertions, copies, and reversals with the same cost as single-character edits (Lopresti & Tomkins, 1997). Most studies to calculate block edit distance to date aimed only to characterize the distance itself for applications in sequence nearest neighbor search without reporting the full alignment details. Although a few tools try to solve block edit distance for genomic sequences, such as GR-Aligner, they have limited functionality and are no longer maintained. Here, we present SABER, an algorithm to solve block edit distance that supports block deletions, block moves, and block reversals in addition to the classical single-character edit operations. Our algorithm runs in O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of l_range; and can report all breakpoints for the block operations. We also provide an implementation of SABER currently optimized for genomic sequences (i.e., generated by the DNA alphabet), although the algorithm can theoretically be used for any alphabet. SABER is available at //github.com/BilkentCompGen/saber
A finite element method is introduced to track interface evolution governed by the level set equation. The method solves for the level set indicator function in a narrow band around the interface. An extension procedure, which is essential for a narrow band level set method, is introduced based on a finite element $L^2$- or $H^1$-projection combined with the ghost-penalty method. This procedure is formulated as a linear variational problem in a narrow band around the surface, making it computationally efficient and suitable for rigorous error analysis. The extension method is combined with a discontinuous Galerkin space discretization and a BDF time-stepping scheme. The paper analyzes the stability and accuracy of the extension procedure and evaluates the performance of the resulting narrow band finite element method for the level set equation through numerical experiments.