For sets $\mathcal Q$ and $\mathcal Y$, the generalized Fr\'echet mean $m \in \mathcal Q$ of a random variable $Y$, which has values in $\mathcal Y$, is any minimizer of $q\mapsto \mathbb E[\mathfrak c(q,Y)]$, where $\mathfrak c \colon \mathcal Q \times \mathcal Y \to \mathbb R$ is a cost function. There are little restrictions to $\mathcal Q$ and $\mathcal Y$. In particular, $\mathcal Q$ can be a non-Euclidean metric space. We provide convergence rates for the empirical generalized Fr\'echet mean. Conditions for rates in probability and rates in expectation are given. In contrast to previous results on Fr\'echet means, we do not require a finite diameter of the $\mathcal Q$ or $\mathcal Y$. Instead, we assume an inequality, which we call quadruple inequality. It generalizes an otherwise common Lipschitz condition on the cost function. This quadruple inequality is known to hold in Hadamard spaces. We show that it also holds in a suitable way for certain powers of a Hadamard-metric.
The Cover Suffix Tree (CST) of a string $T$ is the suffix tree of $T$ with additional explicit nodes corresponding to halves of square substrings of $T$. In the CST an explicit node corresponding to a substring $C$ of $T$ is annotated with two numbers: the number of non-overlapping consecutive occurrences of $C$ and the total number of positions in $T$ that are covered by occurrences of $C$ in $T$. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-$n$ string in $O(n \log n)$ time. We show how to compute the CST in $O(n)$ time assuming that $T$ is over an integer alphabet. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-$n$ string $T$, one can compute a linear-sized representation of all seeds of $T$ as well as all shortest $\alpha$-partial covers and seeds in $T$ for a given $\alpha$ in $O(n)$ time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020). Kociumaka et al. (Algorithmica, 2015) proposed an $O(n \log n)$-time algorithm for computing a shortest $\alpha$-partial cover for each $\alpha=1,\ldots,n$; we improve this complexity to $O(n)$. Our results are based on a new characterization of consecutive overlapping occurrences of a substring $S$ of $T$ in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in $T$. This new insight also leads to an $O(n)$-sized index for reporting overlapping consecutive occurrences of a given pattern $P$ of length $m$ in $O(m+output)$ time, where $output$ is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses $O(n \log n)$ space.
We study the tolerant testing problem for high-dimensional samplers. Given as input two samplers $\mathcal{P}$ and $\mathcal{Q}$ over the $n$-dimensional space $\{0,1\}^n$, and two parameters $\varepsilon_2 > \varepsilon_1$, the goal of tolerant testing is to test whether the distributions generated by $\mathcal{P}$ and $\mathcal{Q}$ are $\varepsilon_1$-close or $\varepsilon_2$-far. Since exponential lower bounds (in $n$) are known for the problem in the standard sampling model, research has focused on models where one can draw \textit{conditional} samples. Among these models, \textit{subcube conditioning} ($\mathsf{SUBCOND}$), which allows conditioning on arbitrary subcubes of the domain, holds the promise of widespread adoption in practice owing to its ability to capture the natural behavior of samplers in constrained domains. To translate the promise into practice, we need to overcome two crucial roadblocks for tests based on $\mathsf{SUBCOND}$: the prohibitively large number of queries ($\tilde{\mathcal{O}}(n^5/\varepsilon_2^5)$) and limitation to non-tolerant testing (i.e., $\varepsilon_1 = 0$). The primary contribution of this work is to overcome the above challenges: we design a new tolerant testing methodology (i.e., $\varepsilon_1 \geq 0$) that allows us to significantly improve the upper bound to $\tilde{\mathcal{O}}(n^3/(\varepsilon_2-\varepsilon_1)^5)$.
In the misspecified spectral algorithms problem, researchers usually assume the underground true function $f_{\rho}^{*} \in [\mathcal{H}]^{s}$, a less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ for some $s\in (0,1)$. The existing minimax optimal results require $\|f_{\rho}^{*}\|_{L^{\infty}}<\infty$ which implicitly requires $s > \alpha_{0}$ where $\alpha_{0}\in (0,1)$ is the embedding index, a constant depending on $\mathcal{H}$. Whether the spectral algorithms are optimal for all $s\in (0,1)$ is an outstanding problem lasting for years. In this paper, we show that spectral algorithms are minimax optimal for any $\alpha_{0}-\frac{1}{\beta} < s < 1$, where $\beta$ is the eigenvalue decay rate of $\mathcal{H}$. We also give several classes of RKHSs whose embedding index satisfies $ \alpha_0 = \frac{1}{\beta} $. Thus, the spectral algorithms are minimax optimal for all $s\in (0,1)$ on these RKHSs.
We introduce TeraHAC, a $(1+\epsilon)$-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1+\epsilon)$-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of $(1+\epsilon)$-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.
Let $ \bbB_n =\frac{1}{n}(\bbR_n + \bbT^{1/2}_n \bbX_n)(\bbR_n + \bbT^{1/2}_n \bbX_n)^* $ where $ \bbX_n $ is a $ p \times n $ matrix with independent standardized random variables, $ \bbR_n $ is a $ p \times n $ non-random matrix, representing the information, and $ \bbT_{n} $ is a $ p \times p $ non-random nonnegative definite Hermitian matrix. Under some conditions on $ \bbR_n \bbR_n^* $ and $ \bbT_n $, it has been proved that for any closed interval outside the support of the limit spectral distribution, with probability one there will be no eigenvalues falling in this interval for all $ p $ sufficiently large. The purpose of this paper is to carry on with the study of the support of the limit spectral distribution, and we show that there is an exact separation phenomenon: with probability one, the proper number of eigenvalues lie on either side of these intervals.
A $c$-labeling $\phi: V(G) \rightarrow \{1, 2, \hdots, c \}$ of graph $G$ is distinguishing if, for every non-trivial automorphism $\pi$ of $G$, there is some vertex $v$ so that $\phi(v) \neq \phi(\pi(v))$. The distinguishing number of $G$, $D(G)$, is the smallest $c$ such that $G$ has a distinguishing $c$-labeling. We consider a compact version of Tyshkevich's graph decomposition theorem where trivial components are maximally combined to form a complete graph or a graph of isolated vertices. Suppose the compact canonical decomposition of $G$ is $G_{k} \circ G_{k-1} \circ \cdots \circ G_1 \circ G_0$. We prove that $\phi$ is a distinguishing labeling of $G$ if and only if $\phi$ is a distinguishing labeling of $G_i$ when restricted to $V(G_i)$ for $i = 0, \hdots, k$. Thus, $D(G) = \max \{D(G_i), i = 0, \hdots, k \}$. We then present an algorithm that computes the distinguishing number of a unigraph in linear time.
A function $f : U \to \{0,\ldots,n-1\}$ is a minimal perfect hash function for a set $S \subseteq U$ of size $n$, if $f$ bijectively maps $S$ into the first $n$ natural numbers. These functions are important for many practical applications in computing, such as search engines, computer networks, and databases. Several algorithms have been proposed to build minimal perfect hash functions that: scale well to large sets, retain fast evaluation time, and take very little space, e.g., 2 - 3 bits/key. PTHash is one such algorithm, achieving very fast evaluation in compressed space, typically several times faster than other techniques. In this work, we propose a new construction algorithm for PTHash enabling: (1) multi-threading, to either build functions more quickly or more space-efficiently, and (2) external-memory processing to scale to inputs much larger than the available internal memory. Only few other algorithms in the literature share these features, despite of their big practical impact. We conduct an extensive experimental assessment on large real-world string collections and show that, with respect to other techniques, PTHash is competitive in construction time and space consumption, but retains 2 - 6$\times$ better lookup time.
A package query returns a package - a multiset of tuples - that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as KD-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster.
We show that the minimax sample complexity for estimating the pseudo-spectral gap $\gamma_{\mathsf{ps}}$ of an ergodic Markov chain in constant multiplicative error is of the order of $$\tilde{\Theta}\left( \frac{1}{\gamma_{\mathsf{ps}} \pi_{\star}} \right),$$ where $\pi_\star$ is the minimum stationary probability, recovering the known bound in the reversible setting for estimating the absolute spectral gap [Hsu et al., 2019], and resolving an open problem of Wolfer and Kontorovich [2019]. Furthermore, we strengthen the known empirical procedure by making it fully-adaptive to the data, thinning the confidence intervals and reducing the computational complexity. Along the way, we derive new properties of the pseudo-spectral gap and introduce the notion of a reversible dilation of a stochastic matrix.
Previous versions of sparse principal component analysis (PCA) have presumed that the eigen-basis (a $p \times k$ matrix) is approximately sparse. We propose a method that presumes the $p \times k$ matrix becomes approximately sparse after a $k \times k$ rotation. The simplest version of the algorithm initializes with the leading $k$ principal components. Then, the principal components are rotated with an $k \times k$ orthogonal rotation to make them approximately sparse. Finally, soft-thresholding is applied to the rotated principal components. This approach differs from prior approaches because it uses an orthogonal rotation to approximate a sparse basis. One consequence is that a sparse component need not to be a leading eigenvector, but rather a mixture of them. In this way, we propose a new (rotated) basis for sparse PCA. In addition, our approach avoids "deflation" and multiple tuning parameters required for that. Our sparse PCA framework is versatile; for example, it extends naturally to a two-way analysis of a data matrix for simultaneous dimensionality reduction of rows and columns. We provide evidence showing that for the same level of sparsity, the proposed sparse PCA method is more stable and can explain more variance compared to alternative methods. Through three applications -- sparse coding of images, analysis of transcriptome sequencing data, and large-scale clustering of social networks, we demonstrate the modern usefulness of sparse PCA in exploring multivariate data.