We show that external randomization may enforce the convergence of test statistics to their limiting distributions in particular cases. This results in a sharper inference. Our approach is based on a central limit theorem for weighted sums. We apply our method to a family of rank-based test statistics and a family of phi-divergence test statistics and prove that, with overwhelming probability with respect to the external randomization, the randomized statistics converge at the rate $O(1/n)$ (up to some logarithmic factors) to the limiting chi-square distribution in Kolmogorov metric.
We analyze the dynamics of a random sequential message passing algorithm for approximate inference with large Gaussian latent variable models in a student-teacher scenario. To model nontrivial dependencies between the latent variables, we assume random covariance matrices drawn from rotation invariant ensembles. Moreover, we consider a model mismatching setting, where the teacher model and the one used by the student may be different. By means of dynamical functional approach, we obtain exact dynamical mean-field equations characterizing the dynamics of the inference algorithm. We also derive a range of model parameters for which the sequential algorithm does not converge. The boundary of this parameter range coincides with the de Almeida Thouless (AT) stability condition of the replica symmetric ansatz for the static probabilistic model.
We are interested in computing the treewidth $\tw(G)$ of a given graph $G$. Our approach is to design heuristic algorithms for computing a sequence of improving upper bounds and a sequence of improving lower bounds, which would hopefully converge to $\tw(G)$ from both sides. The upper bound algorithm extends and simplifies Tamaki's unpublished work on a heuristic use of the dynamic programming algorithm for deciding treewidth due to Bouchitt\'{e} and Todinca. The lower bound algorithm is based on the well-known fact that, for every minor $H$ of $G$, we have $\tw(H) \leq \tw(G)$. Starting from a greedily computed minor $H_0$ of $G$, the algorithm tries to construct a sequence of minors $H_0$, $H_1$, \ldots $H_k$ with $\tw(H_i) < \tw(H_{i + 1})$ for $0 \leq i < k$ and hopefully $\tw(H_k) = \tw(G)$. We have implemented a treewidth solver based on this approach and have evaluated it on the bonus instances from the exact treewidth track of PACE 2017 algorithm implementation challenge. The results show that our approach is extremely effective in tackling instances that are hard for conventional solvers. Our solver has an additional advantage over conventional ones in that it attaches a compact certificate to the lower bound it computes.
Orthology and paralogy relations are often inferred by methods based on gene similarity, which usually yield a graph depicting the relationships between gene pairs. Such relation graphs are known to frequently contain errors, as they cannot be explained via a gene tree that both contains the depicted orthologs/paralogs, and that is consistent with a species tree $S$. This idea of detecting errors through inconsistency with a species tree has mostly been studied in the presence of speciation and duplication events only. In this work, we ask: could the given set of relations be consistent if we allow lateral gene transfers in the evolutionary model? We formalize this question and provide a variety of algorithmic results regarding the underlying problems. Namely, we show that deciding if a relation graph $R$ is consistent with a given species network $N$ is NP-hard, and that it is W[1]-hard under the parameter "minimum number of transfers". However, we present an FPT algorithm based on the degree of the $DS$-tree associated with $R$. We also study analogous problems in the case that the transfer highways on a species tree are unknown.
In this paper, we study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too complicated to derive because it is obtained based on the solution of a complicated algorithm. To circumvent this difficulty, we propose to employ a conditional sampling distribution for the inference, which enables us to derive an exact (non-asymptotic) inference method on the DTW distance. Besides, we also develop a novel computational method to compute the conditional sampling distribution. To our knowledge, this is the first method that can provide valid $p$-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.
In recent years, large-scale Bayesian learning draws a great deal of attention. However, in big-data era, the amount of data we face is growing much faster than our ability to deal with it. Fortunately, it is observed that large-scale datasets usually own rich internal structure and is somewhat redundant. In this paper, we attempt to simplify the Bayesian posterior via exploiting this structure. Specifically, we restrict our interest to the so-called well-clustered datasets and construct an \emph{approximate posterior} according to the clustering information. Fortunately, the clustering structure can be efficiently obtained via a particular clustering algorithm. When constructing the approximate posterior, the data points in the same cluster are all replaced by the centroid of the cluster. As a result, the posterior can be significantly simplified. Theoretically, we show that under certain conditions the approximate posterior we construct is close (measured by KL divergence) to the exact posterior. Furthermore, thorough experiments are conducted to validate the fact that the constructed posterior is a good approximation to the true posterior and much easier to sample from.
Cell type deconvolution is a computational approach to infer proportions of individual cell types from bulk transcriptomics data. Though many new methods have been developed for cell type deconvolution, most of them only provide point estimation of the cell type proportions. On the other hand, estimates of the cell type proportions can be very noisy due to various sources of bias and randomness, and ignoring their uncertainty may greatly affect the validity of downstream analyses. In this paper, we propose a comprehensive statistical framework for cell type deconvolution and construct asymptotically valid confidence intervals both for each individual's cell type proportion and for quantifying how cell type proportions change across multiple bulk individuals in downstream regression analyses. Our analysis takes into account various factors including the biological randomness of gene expressions across cells and individuals, gene-gene dependence, and the cross-platform biases and sequencing errors, and avoids any parametric assumptions on the data distributions. We also provide identification conditions of the cell type proportions when there are arbitrary platforms-specific bias across sequencing technologies.
This paper proposes to test the number of common factors in high-dimensional factor models by bootstrap. We provide asymptotic distributions for the eigenvalues of bootstrapped sample covariance matrix under mild conditions. The spiked eigenvalues converge weakly to Gaussian limits after proper scaling and centralization. The limiting distribution of the largest non-spiked eigenvalue is mainly determined by order statistics of bootstrap resampling weights, and follows extreme value distribution. We propose two testing schemes based on the disparate behavior of the spiked and non-spiked eigenvalues. The testing procedures can perform reliably with weak factors, cross-sectionally and serially correlated errors. Our technical proofs contribute to random matrix theory with convexly decaying density and unbounded support, or with general elliptical distributions.
This paper studies the consistency and statistical inference of simulated Ising models in the high dimensional background. Our estimators are based on the Markov chain Monte Carlo maximum likelihood estimation (MCMC-MLE) method penalized by the Elastic-net. Under mild conditions that ensure a specific convergence rate of MCMC method, the $\ell_{1}$ consistency of Elastic-net-penalized MCMC-MLE is proved. We further propose a decorrelated score test based on the decorrelated score function and prove the asymptotic normality of the score function without the influence of many nuisance parameters under the assumption that accelerates the convergence of the MCMC method. The one-step estimator for a single parameter of interest is purposed by linearizing the decorrelated score function to solve its root, as well as its normality and confidence interval for the true value, therefore, be established. Finally, we use different algorithms to control the false discovery rate (FDR) via traditional p-values and novel e-values.
A recent emerging trend in the literature on learning in games has been concerned with providing faster learning dynamics for correlated and coarse correlated equilibria in normal-form games. Much less is known about the significantly more challenging setting of extensive-form games, which can capture both sequential and simultaneous moves, as well as imperfect information. In this paper we establish faster no-regret learning dynamics for \textit{extensive-form correlated equilibria (EFCE)} in multiplayer general-sum imperfect-information extensive-form games. When all players follow our accelerated dynamics, the correlated distribution of play is an $O(T^{-3/4})$-approximate EFCE, where the $O(\cdot)$ notation suppresses parameters polynomial in the description of the game. This significantly improves over the best prior rate of $O(T^{-1/2})$. To achieve this, we develop a framework for performing accelerated \emph{Phi-regret minimization} via predictions. One of our key technical contributions -- that enables us to employ our generic template -- is to characterize the stability of fixed points associated with \emph{trigger deviation functions} through a refined perturbation analysis of a structured Markov chain. Furthermore, for the simpler solution concept of extensive-form \emph{coarse} correlated equilibrium (EFCCE) we give a new succinct closed-form characterization of the associated fixed points, bypassing the expensive computation of stationary distributions required for EFCE. Our results place EFCCE closer to \emph{normal-form coarse correlated equilibria} in terms of the per-iteration complexity, although the former prescribes a much more compelling notion of correlation. Finally, experiments conducted on standard benchmarks corroborate our theoretical findings.
The Bayesian paradigm has the potential to solve core issues of deep neural networks such as poor calibration and data inefficiency. Alas, scaling Bayesian inference to large weight spaces often requires restrictive approximations. In this work, we show that it suffices to perform inference over a small subset of model weights in order to obtain accurate predictive posteriors. The other weights are kept as point estimates. This subnetwork inference framework enables us to use expressive, otherwise intractable, posterior approximations over such subsets. In particular, we implement subnetwork linearized Laplace: We first obtain a MAP estimate of all weights and then infer a full-covariance Gaussian posterior over a subnetwork. We propose a subnetwork selection strategy that aims to maximally preserve the model's predictive uncertainty. Empirically, our approach is effective compared to ensembles and less expressive posterior approximations over full networks.