The common cause principle for two random variables $A$ and $B$ is examined in the case of causal insufficiency, when their common cause $C$ is known to exist, but only the joint probability of $A$ and $B$ is observed. As a result, $C$ cannot be uniquely identified (the latent confounder problem). We show that the generalized maximum likelihood method can be applied to this situation and allows identification of $C$ that is consistent with the common cause principle. It closely relates to the maximum entropy principle. Investigation of the two binary symmetric variables reveals a non-analytic behavior of conditional probabilities reminiscent of a second-order phase transition. This occurs during the transition from correlation to anti-correlation in the observed probability distribution. The relation between the generalized likelihood approach and alternative methods, such as predictive likelihood and the minimum common cause entropy, is discussed. The consideration of the common cause for three observed variables (and one hidden cause) uncovers causal structures that defy representation through directed acyclic graphs with the Markov condition.
We consider testing invariance of a distribution under an algebraic group of transformations, which includes permutations. In this context, it is commonly believed that one should strive to construct a test based on the entire group. We find that one can sometimes obtain dramatically more power by replacing the entire group with a tiny subgroup. Surprisingly, this allows us to obtain much more power at a much lower computational cost. We examine this finding in the popular group invariance-based Westfall & Young MaxT multiple testing method. Studying the relative efficiency in a Gaussian location model, we find the power gain to be largest in high-dimensional settings.
We study the equivalence testing problem where the goal is to determine if the given two unknown distributions on $[n]$ are equal or $\epsilon$-far in the total variation distance in the conditional sampling model (CFGM, SICOMP16; CRS, SICOMP15) wherein a tester can get a sample from the distribution conditioned on any subset. Equivalence testing is a central problem in distribution testing, and there has been a plethora of work on this topic in various sampling models. Despite significant efforts over the years, there remains a gap in the current best-known upper bound of $\tilde{O}(\log \log n)$ [FJOPS, COLT 2015] and lower bound of $\Omega(\sqrt{\log \log n})$[ACK, RANDOM 2015, Theory of Computing 2018]. Closing this gap has been repeatedly posed as an open problem (listed as problems 66 and 87 at sublinear.info). In this paper, we completely resolve the query complexity of this problem by showing a lower bound of $\tilde{\Omega}(\log \log n)$. For that purpose, we develop a novel and generic proof technique that enables us to break the $\sqrt{\log \log n}$ barrier, not only for the equivalence testing problem but also for other distribution testing problems, such as uniblock property.
The paper studies the rewriting problem, that is, the decision problem whether, for a given conjunctive query $Q$ and a set $\mathcal{V}$ of views, there is a conjunctive query $Q'$ over $\mathcal{V}$ that is equivalent to $Q$, for cases where the query, the views, and/or the desired rewriting are acyclic or even more restricted. It shows that, if $Q$ itself is acyclic, an acyclic rewriting exists if there is any rewriting. An analogous statement also holds for free-connex acyclic, hierarchical, and q-hierarchical queries. Regarding the complexity of the rewriting problem, the paper identifies a border between tractable and (presumably) intractable variants of the rewriting problem: for schemas of bounded arity, the acyclic rewriting problem is NP-hard, even if both $Q$ and the views in $\mathcal{V}$ are acyclic or hierarchical. However, it becomes tractable if the views are free-connex acyclic (i.e., in a nutshell, their body is (i) acyclic and (ii) remains acyclic if their head is added as an additional atom).
We study the fair division of indivisible items with subsidies among $n$ agents, where the absolute marginal valuation of each item is at most one. Under monotone valuations (where each item is a good), Brustle et al. (2020) demonstrated that a maximum subsidy of $2(n-1)$ and a total subsidy of $2(n-1)^2$ are sufficient to guarantee the existence of an envy-freeable allocation. In this paper, we improve upon these bounds, even in a wider model. Namely, we show that, given an EF1 allocation, we can compute in polynomial time an envy-free allocation with a subsidy of at most $n-1$ per agent and a total subsidy of at most $n(n-1)/2$. Moreover, we present further improved bounds for monotone valuations.
We introduce an efficient numerical implementation of a Markov Chain Monte Carlo method to sample a probability distribution on a manifold (introduced theoretically in Zappa, Holmes-Cerfon, Goodman (2018)), where the manifold is defined by the level set of constraint functions, and the probability distribution may involve the pseudodeterminant of the Jacobian of the constraints, as arises in physical sampling problems. The algorithm is easy to implement and scales well to problems with thousands of dimensions and with complex sets of constraints provided their Jacobian retains sparsity. The algorithm uses direct linear algebra and requires a single matrix factorization per proposal point, which enhances its efficiency over previously proposed methods but becomes the computational bottleneck of the algorithm in high dimensions. We test the algorithm on several examples inspired by soft-matter physics and materials science to study its complexity and properties.
Let $G$ be a finite group given as input by its multiplication table. For a subset $S$ of $G$ and an element $g\in G$ the Cayley Group Membership Problem (denoted CGM) is to check if $g$ belongs to the subgroup generated by $S$. While this problem is easily seen to be in polynomial time, pinpointing its parallel complexity has been of research interest over the years. In this paper we further explore the parallel complexity of the abelian CGM problem, with focus on the dynamic setting: the generating set $S$ changes with insertions and deletions and the goal is to maintain a data structure that supports efficient membership queries to the subgroup $\angle{S}$. We obtain the following results: 1. We first consider the more general problem of Monoid Membership. When $G$ is a commutative monoid we give a deterministic dynamic algorithm constant time parallel algorithm for membership testing that supports $O(1)$ insertions and deletions in each step. 2. Building on the previous result we show that there is a dynamic randomized constant-time parallel algorithm for abelian CGM that supports polylogarithmically many insertions/deletions to $S$ in each step. 3. If the number of insertions/deletions is at most $O(\log n/\log\log n)$ then we obtain a deterministic dynamic constant-time parallel algorithm for the problem. 4. We obtain analogous results for the dynamic abelian Group Isomorphism.
Let the costs $C(i,j)$ for an instance of the asymmetric traveling salesperson problem be independent uniform $[0,1]$ random variables. We consider the efficiency of branch and bound algorithms that use the assignment relaxation as a lower bound. We show that w.h.p. the number of steps taken in any such branch and bound algorithm is $e^{\Omega(n^a)}$ for some small absolute constant $a>0$.
Clustering is one of the most important tools for analysis of large datasets, and perhaps the most popular clustering algorithm is Lloyd's iteration for $k$-means. This iteration takes $N$ vectors $v_1,\dots,v_N\in\mathbb{R}^d$ and outputs $k$ centroids $c_1,\dots,c_k\in\mathbb{R}^d$; these partition the vectors into clusters based on which centroid is closest to a particular vector. We present an overall improved version of the "$q$-means" algorithm, the quantum algorithm originally proposed by Kerenidis, Landman, Luongo, and Prakash (2019) which performs $\varepsilon$-$k$-means, an approximate version of $k$-means clustering. This algorithm does not rely on the quantum linear algebra primitives of prior work, instead only using its QRAM to prepare and measure simple states based on the current iteration's clusters. The time complexity is $O\big(\frac{k^{2}}{\varepsilon^2}(\sqrt{k}d + \log(Nd))\big)$ and maintains the polylogarithmic dependence on $N$ while improving the dependence on most of the other parameters. We also present a "dequantized" algorithm for $\varepsilon$-$k$-means which runs in $O\big(\frac{k^{2}}{\varepsilon^2}(kd + \log(Nd))\big)$ time. Notably, this classical algorithm matches the polylogarithmic dependence on $N$ attained by the quantum algorithms.
The best column approximation in the Frobenius norm with $r$ columns has an error at most $\sqrt{r+1}$ times larger than the truncated singular value decomposition. Reaching this bound in practice involves either expensive random volume sampling or at least $r$ executions of singular value decomposition. In this paper it will be shown that the same column approximation bound can be reached with only a single SVD (which can also be replaced with approximate SVD). As a corollary, it will be shown how to find a highly nondegenerate submatrix in $r$ rows of size $N$ in just $O(Nr^2)$ operations, which mostly has the same properties as the maximum volume submatrix.
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.