Triangle centrality is introduced for finding important vertices in a graph based on the concentration of triangles surrounding each vertex. An important vertex in triangle centrality is at the center of many triangles, and therefore it may be in many triangles or none at all. We give optimal algorithms that compute triangle centrality in $O(m\sqrt{m})$ time and $O(m+n)$ space. Using fast matrix multiplication it takes $n^{\omega+o(1)}$ time where $\omega$ is the matrix product exponent. On a Concurrent Read Exclusive Write (CREW) Parallel Random Access Memory (PRAM) machine, we give a near work-optimal algorithm that takes $O(\log n)$ time using $O(m\sqrt{m})$ CREW PRAM processors. In MapReduce, we show it takes four rounds using $O(m\sqrt{m})$ communication bits, and is therefore optimal. We also give a deterministic algorithm to find the triangle neighborhood and triangle count of each vertex in $O(m\sqrt{m})$ time and $O(m+n)$ space. Our empirical results demonstrate that triangle centrality uniquely identified central vertices thirty-percent of the time in comparison to five other well-known centrality measures, while being asymptotically faster to compute on sparse graphs than all but the most trivial of these other measures.
Matrix trace estimation is ubiquitous in machine learning applications and has traditionally relied on Hutchinson's method, which requires $O(\log(1/\delta)/\epsilon^2)$ matrix-vector product queries to achieve a $(1 \pm \epsilon)$-multiplicative approximation to $\text{tr}(A)$ with failure probability $\delta$ on positive-semidefinite input matrices $A$. Recently, the Hutch++ algorithm was proposed, which reduces the number of matrix-vector queries from $O(1/\epsilon^2)$ to the optimal $O(1/\epsilon)$, and the algorithm succeeds with constant probability. However, in the high probability setting, the non-adaptive Hutch++ algorithm suffers an extra $O(\sqrt{\log(1/\delta)})$ multiplicative factor in its query complexity. Non-adaptive methods are important, as they correspond to sketching algorithms, which are mergeable, highly parallelizable, and provide low-memory streaming algorithms as well as low-communication distributed protocols. In this work, we close the gap between non-adaptive and adaptive algorithms, showing that even non-adaptive algorithms can achieve $O(\sqrt{\log(1/\delta)}/\epsilon + \log(1/\delta))$ matrix-vector products. In addition, we prove matching lower bounds demonstrating that, up to a $\log \log(1/\delta)$ factor, no further improvement in the dependence on $\delta$ or $\epsilon$ is possible by any non-adaptive algorithm. Finally, our experiments demonstrate the superior performance of our sketch over the adaptive Hutch++ algorithm, which is less parallelizable, as well as over the non-adaptive Hutchinson's method.
Many big data algorithms executed on MapReduce-like systems have a shuffle phase that often dominates the overall job execution time. Recent work has demonstrated schemes where the communication load in the shuffle phase can be traded off for the computation load in the map phase. In this work, we focus on a class of distributed algorithms, broadly used in deep learning, where intermediate computations of the same task can be combined. Even though prior techniques reduce the communication load significantly, they require a number of jobs that grows exponentially in the system parameters. This limitation is crucial and may diminish the load gains as the algorithm scales. We propose a new scheme which achieves the same load as the state-of-the-art while ensuring that the number of jobs as well as the number of subfiles that the data set needs to be split into remain small.
We consider the problem of estimating a $d$-dimensional discrete distribution from its samples observed under a $b$-bit communication constraint. In contrast to most previous results that largely focus on the global minimax error, we study the local behavior of the estimation error and provide \emph{pointwise} bounds that depend on the target distribution $p$. In particular, we show that the $\ell_2$ error decays with $O\left(\frac{\lVert p\rVert_{1/2}}{n2^b}\vee \frac{1}{n}\right)$ (In this paper, we use $a\vee b$ and $a \wedge b$ to denote $\max(a, b)$ and $\min(a,b)$ respectively.) when $n$ is sufficiently large, hence it is governed by the \emph{half-norm} of $p$ instead of the ambient dimension $d$. For the achievability result, we propose a two-round sequentially interactive estimation scheme that achieves this error rate uniformly over all $p$. Our scheme is based on a novel local refinement idea, where we first use a standard global minimax scheme to localize $p$ and then use the remaining samples to locally refine our estimate. We also develop a new local minimax lower bound with (almost) matching $\ell_2$ error, showing that any interactive scheme must admit a $\Omega\left( \frac{\lVert p \rVert_{{(1+\delta)}/{2}}}{n2^b}\right)$ $\ell_2$ error for any $\delta > 0$. The lower bound is derived by first finding the best parametric sub-model containing $p$, and then upper bounding the quantized Fisher information under this model. Our upper and lower bounds together indicate that the $\mathcal{H}_{1/2}(p) = \log(\lVert p \rVert_{{1}/{2}})$ bits of communication is both sufficient and necessary to achieve the optimal (centralized) performance, where $\mathcal{H}_{{1}/{2}}(p)$ is the R\'enyi entropy of order $2$. Therefore, under the $\ell_2$ loss, the correct measure of the local communication complexity at $p$ is its R\'enyi entropy.
We develop methods for forming prediction sets in an online setting where the data generating distribution is allowed to vary over time in an unknown fashion. Our framework builds on ideas from conformal inference to provide a general wrapper that can be combined with any black box method that produces point predictions of the unseen label or estimated quantiles of its distribution. While previous conformal inference methods rely on the assumption that the data points are exchangeable, our adaptive approach provably achieves the desired coverage frequency over long-time intervals irrespective of the true data generating process. We accomplish this by modelling the distribution shift as a learning problem in a single parameter whose optimal value is varying over time and must be continuously re-estimated. We test our method, adaptive conformal inference, on two real world datasets and find that its predictions are robust to visible and significant distribution shifts.
We revisit the problem of finding optimal strategies for deterministic Markov Decision Processes (DMDPs), and a closely related problem of testing feasibility of systems of $m$ linear inequalities on $n$ real variables with at most two variables per inequality (2VPI). We give a randomized trade-off algorithm solving both problems and running in $\tilde{O}(nmh+(n/h)^3)$ time using $\tilde{O}(n^2/h+m)$ space for any parameter $h\in [1,n]$. In particular, using subquadratic space we get $\tilde{O}(nm+n^{3/2}m^{3/4})$ running time, which improves by a polynomial factor upon all the known upper bounds for non-dense instances with $m=O(n^{2-\epsilon})$. Moreover, using linear space we match the randomized $\tilde{O}(nm+n^3)$ time bound of Cohen and Megiddo [SICOMP'94] that required $\tilde{\Theta}(n^2+m)$ space. Additionally, we show a new algorithm for the Discounted All-Pairs Shortest Paths problem, introduced by Madani et al. [TALG'10], that extends the DMDPs with optional end vertices. For the case of uniform discount factors, we give a deterministic algorithm running in $\tilde{O}(n^{3/2}m^{3/4})$ time, which improves significantly upon the randomized bound $\tilde{O}(n^2\sqrt{m})$ of Madani et al.
Graphs (networks) are an important tool to model data in different domains. Real-world graphs are usually directed, where the edges have a direction and they are not symmetric. Betweenness centrality is an important index widely used to analyze networks. In this paper, first given a directed network $G$ and a vertex $r \in V(G)$, we propose an exact algorithm to compute betweenness score of $r$. Our algorithm pre-computes a set $\mathcal{RV}(r)$, which is used to prune a huge amount of computations that do not contribute to the betweenness score of $r$. Time complexity of our algorithm depends on $|\mathcal{RV}(r)|$ and it is respectively $\Theta(|\mathcal{RV}(r)|\cdot|E(G)|)$ and $\Theta(|\mathcal{RV}(r)|\cdot|E(G)|+|\mathcal{RV}(r)|\cdot|V(G)|\log |V(G)|)$ for unweighted graphs and weighted graphs with positive weights. $|\mathcal{RV}(r)|$ is bounded from above by $|V(G)|-1$ and in most cases, it is a small constant. Then, for the cases where $\mathcal{RV}(r)$ is large, we present a simple randomized algorithm that samples from $\mathcal{RV}(r)$ and performs computations for only the sampled elements. We show that this algorithm provides an $(\epsilon,\delta)$-approximation to the betweenness score of $r$. Finally, we perform extensive experiments over several real-world datasets from different domains for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. Our experiments reveal that for estimating betweenness score of a single vertex, our algorithm significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Our experiments also reveal that our algorithm improves the existing algorithms when someone is interested in computing betweenness values of the vertices in a set whose cardinality is very small.
A general quantum circuit can be simulated classically in exponential time. If it has a planar layout, then a tensor-network contraction algorithm due to Markov and Shi has a runtime exponential in the square root of its size, or more generally exponential in the treewidth of the underlying graph. Separately, Gottesman and Knill showed that if all gates are restricted to be Clifford, then there is a polynomial time simulation. We combine these two ideas and show that treewidth and planarity can be exploited to improve Clifford circuit simulation. Our main result is a classical algorithm with runtime scaling asymptotically as $n^{\omega/2}<n^{1.19}$ which samples from the output distribution obtained by measuring all $n$ qubits of a planar graph state in given Pauli bases. Here $\omega$ is the matrix multiplication exponent. We also provide a classical algorithm with the same asymptotic runtime which samples from the output distribution of any constant-depth Clifford circuit in a planar geometry. Our work improves known classical algorithms with cubic runtime. A key ingredient is a mapping which, given a tree decomposition of some graph $G$, produces a Clifford circuit with a structure that mirrors the tree decomposition and which emulates measurement of the corresponding graph state. We provide a classical simulation of this circuit with the runtime stated above for planar graphs and otherwise $nt^{\omega-1}$ where $t$ is the width of the tree decomposition. Our algorithm incorporates two subroutines which may be of independent interest. The first is a matrix-multiplication-time version of the Gottesman-Knill simulation of multi-qubit measurement on stabilizer states. The second is a new classical algorithm for solving symmetric linear systems over $\mathbb{F}_2$ in a planar geometry, extending previous works which only applied to non-singular linear systems in the analogous setting.
Multiplying matrices is among the most fundamental and compute-intensive operations in machine learning. Consequently, there has been significant work on efficiently approximating matrix multiplies. We introduce a learning-based algorithm for this task that greatly outperforms existing methods. Experiments using hundreds of matrices from diverse domains show that it often runs $100\times$ faster than exact matrix products and $10\times$ faster than current approximate methods. In the common case that one matrix is known ahead of time, our method also has the interesting property that it requires zero multiply-adds. These results suggest that a mixture of hashing, averaging, and byte shuffling$-$the core operations of our method$-$could be a more promising building block for machine learning than the sparsified, factorized, and/or scalar quantized matrix products that have recently been the focus of substantial research and hardware investment.
Adversarial training is among the most effective techniques to improve the robustness of models against adversarial perturbations. However, the full effect of this approach on models is not well understood. For example, while adversarial training can reduce the adversarial risk (prediction error against an adversary), it sometimes increase standard risk (generalization error when there is no adversary). Even more, such behavior is impacted by various elements of the learning problem, including the size and quality of training data, specific forms of adversarial perturbations in the input, model overparameterization, and adversary's power, among others. In this paper, we focus on \emph{distribution perturbing} adversary framework wherein the adversary can change the test distribution within a neighborhood of the training data distribution. The neighborhood is defined via Wasserstein distance between distributions and the radius of the neighborhood is a measure of adversary's manipulative power. We study the tradeoff between standard risk and adversarial risk and derive the Pareto-optimal tradeoff, achievable over specific classes of models, in the infinite data limit with features dimension kept fixed. We consider three learning settings: 1) Regression with the class of linear models; 2) Binary classification under the Gaussian mixtures data model, with the class of linear classifiers; 3) Regression with the class of random features model (which can be equivalently represented as two-layer neural network with random first-layer weights). We show that a tradeoff between standard and adversarial risk is manifested in all three settings. We further characterize the Pareto-optimal tradeoff curves and discuss how a variety of factors, such as features correlation, adversary's power or the width of two-layer neural network would affect this tradeoff.
We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.