Observational studies are needed when experiments are not possible. Within study comparisons (WSC) compare observational and experimental estimates that test the same hypothesis using the same treatment group, outcome, and estimand. Meta-analyzing 39 of them, we compare mean bias and its variance for the eight observational designs that result from combining whether there is a pretest measure of the outcome or not, whether the comparison group is local to the treatment group or not, and whether there is a relatively rich set of other covariates or not. Of these eight designs, one combines all three design elements, another has none, and the remainder include any one or two. We found that both the mean and variance of bias decline as design elements are added, with the lowest mean and smallest variance in a design with all three elements. The probability of bias falling within 0.10 standard deviations of the experimental estimate varied from 59 to 83 percent in Bayesian analyses and from 86 to 100 percent in non-Bayesian ones -- the ranges depending on the level of data aggregation. But confounding remains possible due to each of the eight observational study design cells including a different set of WSC studies.
A priori error bounds have been derived for different balancing-related model reduction methods. The most classical result is a bound for balanced truncation and singular perturbation approximation that is applicable for asymptotically stable linear time-invariant systems with homogeneous initial conditions. Recently, there have been a few attempts to generalize the balancing-related reduction methods to the case with inhomogeneous initial conditions, but the existing error bounds for these generalizations are quite restrictive. Particularly, it is required to restrict the initial conditions to a low-dimensional subspace, which has to be chosen before the reduced model is constructed. In this paper, we propose an estimator that circumvents this hard constraint completely. Our estimator is applicable to a large class of reduction methods, whereas the former results were only derived for certain specific methods. Moreover, our approach yields to significantly more effective error estimation, as also will be demonstrated numerically.
We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $\gamma$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.
One possibility of defining a quantum R\'enyi $\alpha$-divergence of two quantum states is to optimize the classical R\'enyi $\alpha$-divergence of their post-measurement probability distributions over all possible measurements (measured R\'enyi divergence), and maybe regularize these quantities over multiple copies of the two states (regularized measured R\'enyi $\alpha$-divergence). A key observation behind the theorem for the strong converse exponent of asymptotic binary quantum state discrimination is that the regularized measured R\'enyi $\alpha$-divergence coincides with the sandwiched R\'enyi $\alpha$-divergence when $\alpha>1$. Moreover, it also follows from the same theorem that to achieve this, it is sufficient to consider $2$-outcome measurements (tests) for any number of copies (this is somewhat surprising, as achieving the measured R\'enyi $\alpha$-divergence for $n$ copies might require a number of measurement outcomes that diverges in $n$, in general). In view of this, it seems natural to expect the same when $\alpha<1$; however, we show that this is not the case. In fact, we show that even for commuting states (classical case) the regularized quantity attainable using $2$-outcome measurements is in general strictly smaller than the R\'enyi $\alpha$-divergence (which is unique in the classical case). In the general quantum case this shows that the above ``regularized test-measured'' R\'enyi $\alpha$-divergence is not even a quantum extension of the classical R\'enyi divergence when $\alpha<1$, in sharp contrast to the $\alpha>1$ case.
Hamilton and Moitra (2021) showed that, in certain regimes, it is not possible to accelerate Riemannian gradient descent in the hyperbolic plane if we restrict ourselves to algorithms which make queries in a (large) bounded domain and which receive gradients and function values corrupted by a (small) amount of noise. We show that acceleration remains unachievable for any deterministic algorithm which receives exact gradient and function-value information (unbounded queries, no noise). Our results hold for the classes of strongly and nonstrongly geodesically convex functions, and for a large class of Hadamard manifolds including hyperbolic spaces and the symmetric space $\mathrm{SL}(n) / \mathrm{SO}(n)$ of positive definite $n \times n$ matrices of determinant one. This cements a surprising gap between the complexity of convex optimization and geodesically convex optimization: for hyperbolic spaces, Riemannian gradient descent is optimal on the class of smooth and and strongly geodesically convex functions, in the regime where the condition number scales with the radius of the optimization domain. The key idea for proving the lower bound consists of perturbing the hard functions of Hamilton and Moitra (2021) with sums of bump functions chosen by a resisting oracle.
Causality can be described in terms of a structural causal model (SCM) that carries information on the variables of interest and their mechanistic relations. For most processes of interest the underlying SCM will only be partially observable, thus causal inference tries to leverage any exposed information. Graph neural networks (GNN) as universal approximators on structured input pose a viable candidate for causal learning, suggesting a tighter integration with SCM. To this effect we present a theoretical analysis from first principles that establishes a novel connection between GNN and SCM while providing an extended view on general neural-causal models. We then establish a new model class for GNN-based causal inference that is necessary and sufficient for causal effect identification. Our empirical illustration on simulations and standard benchmarks validate our theoretical proofs.
The focus of disentanglement approaches has been on identifying independent factors of variation in data. However, the causal variables underlying real-world observations are often not statistically independent. In this work, we bridge the gap to real-world scenarios by analyzing the behavior of the most prominent disentanglement approaches on correlated data in a large-scale empirical study (including 4260 models). We show and quantify that systematically induced correlations in the dataset are being learned and reflected in the latent representations, which has implications for downstream applications of disentanglement such as fairness. We also demonstrate how to resolve these latent correlations, either using weak supervision during training or by post-hoc correcting a pre-trained model with a small number of labels.
The aim of this paper is to offer the first systematic exploration and definition of equivalent causal models in the context where both models are not made up of the same variables. The idea is that two models are equivalent when they agree on all "essential" causal information that can be expressed using their common variables. I do so by focussing on the two main features of causal models, namely their structural relations and their functional relations. In particular, I define several relations of causal ancestry and several relations of causal sufficiency, and require that the most general of these relations are preserved across equivalent models.
This paper seeks to develop a deeper understanding of the fundamental properties of neural text generations models. The study of artifacts that emerge in machine generated text as a result of modeling choices is a nascent research area. Previously, the extent and degree to which these artifacts surface in generated text has not been well studied. In the spirit of better understanding generative text models and their artifacts, we propose the new task of distinguishing which of several variants of a given model generated a piece of text, and we conduct an extensive suite of diagnostic tests to observe whether modeling choices (e.g., sampling methods, top-$k$ probabilities, model architectures, etc.) leave detectable artifacts in the text they generate. Our key finding, which is backed by a rigorous set of experiments, is that such artifacts are present and that different modeling choices can be inferred by observing the generated text alone. This suggests that neural text generators may be more sensitive to various modeling choices than previously thought.
Both generative adversarial network models and variational autoencoders have been widely used to approximate probability distributions of datasets. Although they both use parametrized distributions to approximate the underlying data distribution, whose exact inference is intractable, their behaviors are very different. In this report, we summarize our experiment results that compare these two categories of models in terms of fidelity and mode collapse. We provide a hypothesis to explain their different behaviors and propose a new model based on this hypothesis. We further tested our proposed model on MNIST dataset and CelebA dataset.
In this paper we study the frequentist convergence rate for the Latent Dirichlet Allocation (Blei et al., 2003) topic models. We show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in Wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of Anandkumar et al. (2012, 2014) from an information-theoretical perspective. We also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.