The tensor Ising model is a discrete exponential family used for modeling binary data on networks with not just pairwise, but higher-order dependencies. In this exponential family, the sufficient statistic is a multi-linear form of degree $p\ge 2$, designed to capture $p$-fold interactions between the binary variables sitting on the nodes of a network. A particularly useful class of tensor Ising models are the tensor Curie-Weiss models, where one assumes that all $p$-tuples of nodes interact with the same intensity. Computing the maximum likelihood estimator (MLE) is computationally cumbersome in this model, due to the presence of an inexplicit normalizing constant in the likelihood, for which the standard alternative is to use the maximum pseudolikelihood estimator (MPLE). Both the MLE and the MPLE are consistent estimators of the natural parameter, provided the latter lies strictly above a certain threshold, which is slightly below $\log 2$, and approaches $\log 2$ as $p$ increases. In this paper, we compute the Bahadur efficiencies of the MLE and the MPLE above the threshold, and derive the optimal sample size (number of nodes) needed for either of these tests to achieve significance. We show that the optimal sample size for the MPLE and the MLE agree if either $p=2$ or the null parameter is greater than or equal to $\log 2$. On the other hand, if $p\ge 3$ and the null parameter lies strictly between the threshold and $\log 2$, then the two differ for sufficiently large values of the alternative. In particular, for every fixed alternative above the threshold, the Bahadur asymptotic relative efficiency of the MLE with respect to the MPLE goes to $\infty$ as the null parameter approaches the threshold. We also provide graphical presentations of the exact numerical values of the theoretical optimal sample sizes in different settings.
A locally testable code is an error-correcting code that admits very efficient probabilistic tests of membership. Tensor codes provide a simple family of combinatorial constructions of locally testable codes that generalize the family of Reed-Muller codes. The natural test for tensor codes, the axis-parallel line vs. point test, plays an essential role in constructions of probabilistically checkable proofs. We analyze the axis-parallel line vs. point test as a two-prover game and show that the test is sound against quantum provers sharing entanglement. Our result implies the quantum-soundness of the low individual degree test, which is an essential component of the MIP* = RE theorem. Our proof also generalizes to the infinite-dimensional commuting-operator model of quantum provers.
Bayesian methods estimate a measure of uncertainty by using the posterior distribution. One source of difficulty in these methods is the computation of the normalizing constant. Calculating exact posterior is generally intractable and we usually approximate it. Variational Inference (VI) methods approximate the posterior with a distribution usually chosen from a simple family using optimization. The main contribution of this work is described is a set of update rules for natural gradient variational inference with mixture of Gaussians, which can be run independently for each of the mixture components, potentially in parallel.
The CP decomposition for high dimensional non-orthogonal spiked tensors is an important problem with broad applications across many disciplines. However, previous works with theoretical guarantee typically assume restrictive incoherence conditions on the basis vectors for the CP components. In this paper, we propose new computationally efficient composite PCA and concurrent orthogonalization algorithms for tensor CP decomposition with theoretical guarantees under mild incoherence conditions. The composite PCA applies the principal component or singular value decompositions twice, first to a matrix unfolding of the tensor data to obtain singular vectors and then to the matrix folding of the singular vectors obtained in the first step. It can be used as an initialization for any iterative optimization schemes for the tensor CP decomposition. The concurrent orthogonalization algorithm iteratively estimates the basis vector in each mode of the tensor by simultaneously applying projections to the orthogonal complements of the spaces generated by others CP components in other modes. It is designed to improve the alternating least squares estimator and other forms of the high order orthogonal iteration for tensors with low or moderately high CP ranks, and it is guaranteed to converge rapidly when the error of any given initial estimator is bounded by a small constant. Our theoretical investigation provides estimation accuracy and convergence rates for the two proposed algorithms. Our implementations on synthetic data demonstrate significant practical superiority of our approach over existing methods.
We consider the problem of sampling from the ferromagnetic Potts and random-cluster models on a general family of random graphs via the Glauber dynamics for the random-cluster model. The random-cluster model is parametrized by an edge probability $p \in (0,1)$ and a cluster weight $q > 0$. We establish that for every $q\ge 1$, the random-cluster Glauber dynamics mixes in optimal $\Theta(n\log n)$ steps on $n$-vertex random graphs having a prescribed degree sequence with bounded average branching $\gamma$ throughout the full high-temperature uniqueness regime $p<p_u(q,\gamma)$. The family of random graph models we consider include the Erd\H{o}s--R\'enyi random graph $G(n,\gamma/n)$, and so we provide the first polynomial-time sampling algorithm for the ferromagnetic Potts model on the Erd\H{o}s--R\'enyi random graphs that works for all $q$ in the full uniqueness regime. We accompany our results with mixing time lower bounds (exponential in the maximum degree) for the Potts Glauber dynamics, in the same settings where our $\Theta(n \log n)$ bounds for the random-cluster Glauber dynamics apply. This reveals a significant computational advantage of random-cluster based algorithms for sampling from the Potts Gibbs distribution at high temperatures in the presence of high-degree vertices.
We study the reknown deconvolution problem of recovering a distribution function from independent replicates (signal) additively contaminated with random errors (noise), whose distribution is known. We investigate whether a Bayesian nonparametric approach for modelling the latent distribution of the signal can yield inferences with asymptotic frequentist validity under the $L^1$-Wasserstein metric. When the error density is ordinary smooth, we develop two inversion inequalities relating either the $L^1$ or the $L^1$-Wasserstein distance between two mixture densities (of the observations) to the $L^1$-Wasserstein distance between the corresponding distributions of the signal. This smoothing inequality improves on those in the literature. We apply this general result to a Bayesian approach bayes on a Dirichlet process mixture of normal distributions as a prior on the mixing distribution (or distribution of the signal), with a Laplace or Linnik noise. In particular we construct an \textit{adaptive} approximation of the density of the observations by the convolution of a Laplace (or Linnik) with a well chosen mixture of normal densities and show that the posterior concentrates at the minimax rate up to a logarithmic factor. The same prior law is shown to also adapt to the Sobolev regularity level of the mixing density, thus leading to a new Bayesian estimation method, relative to the Wasserstein distance, for distributions with smooth densities.
This paper considers the inference for heterogeneous treatment effects in dynamic settings that covariates and treatments are longitudinal. We focus on high-dimensional cases that the sample size, $N$, is potentially much larger than the covariate vector's dimension, $d$. The marginal structural mean models are considered. We propose a "sequential model doubly robust" estimator constructed based on "moment targeted" nuisance estimators. Such nuisance estimators are carefully designed through non-standard loss functions, reducing the bias resulting from potential model misspecifications. We achieve $\sqrt N$-inference even when model misspecification occurs. We only require one nuisance model to be correctly specified at each time spot. Such model correctness conditions are weaker than all the existing work, even containing the literature on low dimensions.
A common approach to synthetic data is to sample from a fitted model. We show that under general assumptions, this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution. Motivated by this, we propose a general method of producing synthetic data, which is widely applicable for parametric models, has asymptotically efficient summary statistics, and is both easily implemented and highly computationally efficient. Our approach allows for the construction of both partially synthetic datasets, which preserve certain summary statistics, as well as fully synthetic data which satisfy the strong guarantee of differential privacy (DP), both with the same asymptotic guarantees. We also provide theoretical and empirical evidence that the distribution from our procedure converges to the true distribution. Besides our focus on synthetic data, our procedure can also be used to perform approximate hypothesis tests in the presence of intractable likelihood functions.
The matrix normal model, the family of Gaussian matrix-variate distributions whose covariance matrix is the Kronecker product of two lower dimensional factors, is frequently used to model matrix-variate data. The tensor normal model generalizes this family to Kronecker products of three or more factors. We study the estimation of the Kronecker factors of the covariance matrix in the matrix and tensor models. We show nonasymptotic bounds for the error achieved by the maximum likelihood estimator (MLE) in several natural metrics. In contrast to existing bounds, our results do not rely on the factors being well-conditioned or sparse. For the matrix normal model, all our bounds are minimax optimal up to logarithmic factors, and for the tensor normal model our bound for the largest factor and overall covariance matrix are minimax optimal up to constant factors provided there are enough samples for any estimator to obtain constant Frobenius error. In the same regimes as our sample complexity bounds, we show that an iterative procedure to compute the MLE known as the flip-flop algorithm converges linearly with high probability. Our main tool is geodesic strong convexity in the geometry on positive-definite matrices induced by the Fisher information metric. This strong convexity is determined by the expansion of certain random quantum channels. We also provide numerical evidence that combining the flip-flop algorithm with a simple shrinkage estimator can improve performance in the undersampled regime.
Clustering is one of the most fundamental and wide-spread techniques in exploratory data analysis. Yet, the basic approach to clustering has not really changed: a practitioner hand-picks a task-specific clustering loss to optimize and fit the given data to reveal the underlying cluster structure. Some types of losses---such as k-means, or its non-linear version: kernelized k-means (centroid based), and DBSCAN (density based)---are popular choices due to their good empirical performance on a range of applications. Although every so often the clustering output using these standard losses fails to reveal the underlying structure, and the practitioner has to custom-design their own variation. In this work we take an intrinsically different approach to clustering: rather than fitting a dataset to a specific clustering loss, we train a recurrent model that learns how to cluster. The model uses as training pairs examples of datasets (as input) and its corresponding cluster identities (as output). By providing multiple types of training datasets as inputs, our model has the ability to generalize well on unseen datasets (new clustering tasks). Our experiments reveal that by training on simple synthetically generated datasets or on existing real datasets, we can achieve better clustering performance on unseen real-world datasets when compared with standard benchmark clustering techniques. Our meta clustering model works well even for small datasets where the usual deep learning models tend to perform worse.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.