Omics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these non-biological errors prior to any statistical analysis. One of the reasons for normalizing data is the need for recovering the covariance matrix used in gene network analysis. In this paper, we introduce a novel normalization technique, called the covariance shift (C-SHIFT) method. This normalization algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise (in biology, known as the bias). Thus, it is perfectly suited for the analysis of logarithmic gene expression data. Numerical experiments on synthetic data demonstrate the method's advantage over the classical normalization techniques. Namely, the comparison is made with Rank, Quantile, cyclic LOESS (locally estimated scatterplot smoothing), and MAD (median absolute deviation) normalization methods. We also evaluate the performance of C-SHIFT algorithm on real biological data.
Discovery of new knowledge is increasingly data-driven, predicated on a team's ability to collaboratively create, find, analyze, retrieve, and share pertinent datasets over the duration of an investigation. This is especially true in the domain of scientific discovery where generation, analysis, and interpretation of data are the fundamental mechanisms by which research teams collaborate to achieve their shared scientific goal. Data-driven discovery in general, and scientific discovery in particular, is distinguished by complex and diverse data models and formats that evolve over the lifetime of an investigation. While databases and related information systems have the potential to be valuable tools in the discovery process, developing effective interfaces for data-driven discovery remains a roadblock to the application of database technology as an essential tool in scientific investigations. In this paper, we present a model-adaptive approach to creating interaction environments for data-driven discovery of scientific data that automatically generates interactive user interfaces for editing, searching, and viewing scientific data based entirely on introspection of an extended relational data model. We have applied model-adaptive interface generation to many active scientific investigations spanning domains of proteomics, bioinformatics, neuroscience, occupational therapy, stem cells, genitourinary, craniofacial development, and others. We present the approach, its implementation, and its evaluation through analysis of its usage in diverse scientific settings.
We investigate a clustering problem with data from a mixture of Gaussians that share a common but unknown, and potentially ill-conditioned, covariance matrix. We start by considering Gaussian mixtures with two equally-sized components and derive a Max-Cut integer program based on maximum likelihood estimation. We prove its solutions achieve the optimal misclassification rate when the number of samples grows linearly in the dimension, up to a logarithmic factor. However, solving the Max-cut problem appears to be computationally intractable. To overcome this, we develop an efficient spectral algorithm that attains the optimal rate but requires a quadratic sample size. Although this sample complexity is worse than that of the Max-cut problem, we conjecture that no polynomial-time method can perform better. Furthermore, we gather numerical and theoretical evidence that supports the existence of a statistical-computational gap. Finally, we generalize the Max-Cut program to a $k$-means program that handles multi-component mixtures with possibly unequal weights. It enjoys similar optimality guarantees for mixtures of distributions that satisfy a transportation-cost inequality, encompassing Gaussian and strongly log-concave distributions.
We consider a model of energy minimization arising in the study of the mechanical behavior caused by cell contraction within a fibrous biological medium. The macroscopic model is based on the theory of non rank-one convex nonlinear elasticity for phase transitions. We study appropriate numerical approximations based on the discontinuous Galerkin treatment of higher gradients and used succesfully in numerical simulations of experiments. We show that the discrete minimizers converge in the limit to minimizers of the continuous problem. This is achieved by employing the theory of $\Gamma$-convergence of the approximate energy functionals to the continuous model when the discretization parameter tends to zero. The analysis is involved due to the structure of numerical approximations which are defined in spaces with lower regularity than the space where the minimizers of the continuous variational problem are sought. This fact leads to the development of a new approach to $\Gamma$-convergence, appropriate for discontinuous finite element discretizations, which can be applied to quite general energy minimization problems. Furthermore, the adoption of exponential terms penalising the interpenetration of matter requires a new framework based on Orlicz spaces for discontinuous Galerkin methods which is developed in this paper as well.
This paper investigates a novel offline change-point detection problem from an information-theoretic perspective. In contrast to most related works, we assume that the knowledge of the underlying pre- and post-change distributions are not known and can only be learned from the training sequences which are available. We further require the probability of the \emph{estimation error} to decay either exponentially or sub-exponentially fast (corresponding respectively to the large and moderate deviations regimes in information theory parlance). Based on the training sequences as well as the test sequence consisting of a single change-point, we design a change-point estimator and further show that this estimator is optimal by establishing matching (strong) converses. This leads to a full characterization of the optimal confidence width (i.e., half the width of the confidence interval within which the true change-point is located at with high probability) as a function of the undetected error, under both the large and moderate deviations regimes.
We consider the problem of high-dimensional filtering of state-space models (SSMs) at discrete times. This problem is particularly challenging as analytical solutions are typically not available and many numerical approximation methods can have a cost that scales exponentially with the dimension of the hidden state. Inspired by lag-approximation methods for the smoothing problem, we introduce a lagged approximation of the smoothing distribution that is necessarily biased. For certain classes of SSMs, particularly those that forget the initial condition exponentially fast in time, the bias of our approximation is shown to be uniformly controlled in the dimension and exponentially small in time. We develop a sequential Monte Carlo (SMC) method to recursively estimate expectations with respect to our biased filtering distributions. Moreover, we prove for a class of non-i.i.d.~SSMs that as the dimension $d\rightarrow\infty$ the cost to achieve a stable mean square error in estimation, for classes of expectations, is of $\mathcal{O}(Nd^2)$ per-unit time, where $N$ is the number of simulated samples in the SMC algorithm. Our methodology is implemented on several challenging high-dimensional examples including the conservative shallow-water model.
We study the rank of the instantaneous or spot covariance matrix $\Sigma_X(t)$ of a multidimensional continuous semi-martingale $X(t)$. Given high-frequency observations $X(i/n)$, $i=0,\ldots,n$, we test the null hypothesis $rank(\Sigma_X(t))\le r$ for all $t$ against local alternatives where the average $(r+1)$st eigenvalue is larger than some signal detection rate $v_n$. A major problem is that the inherent averaging in local covariance statistics produces a bias that distorts the rank statistics. We show that the bias depends on the regularity and a spectral gap of $\Sigma_X(t)$. We establish explicit matrix perturbation and concentration results that provide non-asymptotic uniform critical values and optimal signal detection rates $v_n$. This leads to a rank estimation method via sequential testing. For a class of stochastic volatility models, we determine data-driven critical values via normed p-variations of estimated local covariance matrices. The methods are illustrated by simulations and an application to high-frequency data of U.S. government bonds.
Normalizing flows are a promising tool for modeling probability distributions in physical systems. While state-of-the-art flows accurately approximate distributions and energies, applications in physics additionally require smooth energies to compute forces and higher-order derivatives. Furthermore, such densities are often defined on non-trivial topologies. A recent example are Boltzmann Generators for generating 3D-structures of peptides and small proteins. These generative models leverage the space of internal coordinates (dihedrals, angles, and bonds), which is a product of hypertori and compact intervals. In this work, we introduce a class of smooth mixture transformations working on both compact intervals and hypertori. Mixture transformations employ root-finding methods to invert them in practice, which has so far prevented bi-directional flow training. To this end, we show that parameter gradients and forces of such inverses can be computed from forward evaluations via the inverse function theorem. We demonstrate two advantages of such smooth flows: they allow training by force matching to simulation data and can be used as potentials in molecular dynamics simulations.
Learning the relationships between various entities from time-series data is essential in many applications. Gaussian graphical models have been studied to infer these relationships. However, existing algorithms process data in a batch at a central location, limiting their applications in scenarios where data is gathered by different agents. In this paper, we propose a distributed sparse inverse covariance algorithm to learn the network structure (i.e., dependencies among observed entities) in real-time from data collected by distributed agents. Our approach is built on an online graphical alternating minimization algorithm, augmented with a consensus term that allows agents to learn the desired structure cooperatively. We allow the system designer to select the number of communication rounds and optimization steps per data point. We characterize the rate of convergence of our algorithm and provide simulations on synthetic datasets.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.
In this paper we introduce a covariance framework for the analysis of EEG and MEG data that takes into account observed temporal stationarity on small time scales and trial-to-trial variations. We formulate a model for the covariance matrix, which is a Kronecker product of three components that correspond to space, time and epochs/trials, and consider maximum likelihood estimation of the unknown parameter values. An iterative algorithm that finds approximations of the maximum likelihood estimates is proposed. We perform a simulation study to assess the performance of the estimator and investigate the influence of different assumptions about the covariance factors on the estimated covariance matrix and on its components. Apart from that, we illustrate our method on real EEG and MEG data sets. The proposed covariance model is applicable in a variety of cases where spontaneous EEG or MEG acts as source of noise and realistic noise covariance estimates are needed for accurate dipole localization, such as in evoked activity studies, or where the properties of spontaneous EEG or MEG are themselves the topic of interest, such as in combined EEG/fMRI experiments in which the correlation between EEG and fMRI signals is investigated.