Suffix trees are an important data structure at the core of optimal solutions to many fundamental string problems, such as exact pattern matching, longest common substring, matching statistics, and longest repeated substring. Recent lines of research focused on extending some of these problems to vertex-labeled graphs, although using ad-hoc approaches which in some cases do not generalize to all input graphs. In the absence of a ubiquitous tool like the suffix tree for labeled graphs, we introduce the labeled direct product of two graphs as a general tool for obtaining optimal algorithms: we obtain conceptually simpler algorithms for the quadratic problems of string matching (SMLG) and longest common substring (LCSP) in labeled graphs. Our algorithms are also more efficient, since they run in time linear in the size of the labeled product graph, which may be smaller than quadratic for some inputs, and their run-time is predictable, because the size of the labeled direct product graph can be precomputed efficiently. We also solve LCSP on graphs containing cycles, which was left as an open problem by Shimohira et al. in 2011. To show the power of the labeled product graph, we also apply it to solve the matching statistics (MSP) and the longest repeated string (LRSP) problems in labeled graphs. Moreover, we show that our (worst-case quadratic) algorithms are also optimal, conditioned on the Orthogonal Vectors Hypothesis. Finally, we complete the complexity picture around LRSP by studying it on undirected graphs.
We prove that the path-finding problem in $\ell$-isogeny graphs and the endomorphism ring problem for supersingular elliptic curves are equivalent under reductions of polynomial expected time, assuming the generalised Riemann hypothesis. The presumed hardness of these problems is foundational for isogeny-based cryptography. As an essential tool, we develop a rigorous algorithm for the quaternion analog of the path-finding problem, building upon the heuristic method of Kohel, Lauter, Petit and Tignol. This problem, and its (previously heuristic) resolution, are both a powerful cryptanalytic tool and a building-block for cryptosystems.
We overcome two major bottlenecks in the study of low rank approximation by assuming the low rank factors themselves are sparse. Specifically, (1) for low rank approximation with spectral norm error, we show how to improve the best known $\mathsf{nnz}(\mathbf A) k / \sqrt{\varepsilon}$ running time to $\mathsf{nnz}(\mathbf A)/\sqrt{\varepsilon}$ running time plus low order terms depending on the sparsity of the low rank factors, and (2) for streaming algorithms for Frobenius norm error, we show how to bypass the known $\Omega(nk/\varepsilon)$ memory lower bound and obtain an $s k (\log n)/ \mathrm{poly}(\varepsilon)$ memory bound, where $s$ is the number of non-zeros of each low rank factor. Although this algorithm is inefficient, as it must be under standard complexity theoretic assumptions, we also present polynomial time algorithms using $\mathrm{poly}(s,k,\log n,\varepsilon^{-1})$ memory that output rank $k$ approximations supported on a $O(sk/\varepsilon)\times O(sk/\varepsilon)$ submatrix. Both the prior $\mathsf{nnz}(\mathbf A) k / \sqrt{\varepsilon}$ running time and the $nk/\varepsilon$ memory for these problems were long-standing barriers; our results give a natural way of overcoming them assuming sparsity of the low rank factors.
The rate of convergence of weighted kernel herding (WKH) and sequential Bayesian quadrature (SBQ), two kernel-based sampling algorithms for estimating integrals with respect to some target probability measure, is investigated. Under verifiable conditions on the chosen kernel and target measure, we establish a near-geometric rate of convergence for target measures that are nearly atomic. Furthermore, we show these algorithms perform comparably to the theoretical best possible sampling algorithm under the maximum mean discrepancy. An analysis is also conducted in a distributed setting. Our theoretical developments are supported by empirical observations on simulated data as well as a real world application.
We study the imbalance problem on complete bipartite graphs. The imbalance problem is a graph layout problem and is known to be NP-complete. Graph layout problems find their applications in the optimization of networks for parallel computer architectures, VLSI circuit design, information retrieval, numerical analysis, computational biology, graph theory, scheduling and archaeology. In this paper, we give characterizations for the optimal solutions of the imbalance problem on complete bipartite graphs. Using the characterizations, we can solve the imbalance problem in $\mathcal{O}(\log(|V|) \cdot \log(\log(|V|)))$ time, when given the cardinalities of the parts of the graph, and verify whether a given solution is optimal in $O(|V|)$ time on complete bipartite graphs. We also introduce a restricted form of proper interval bipartite graphs on which the imbalance problem is solvable in $\mathcal{O}(c \cdot \log(|V|) \cdot \log(\log(|V|)))$ time, where $c = \mathcal{O}(|V|)$, by using the aforementioned characterizations.
In 1973, Lemmens and Seidel posed the problem of determining $N_\alpha(r)$, the maximum number of equiangular lines in $\mathbb{R}^r$ with common angle $\arccos(\alpha)$. Recently, this question has been almost completely settled in the case where $r$ is large relative to $1/\alpha$, with the approach relying on Ramsey's theorem. In this paper, we use orthogonal projections of matrices with respect to the Frobenius inner product in order to overcome this limitation, thereby obtaining upper bounds on $N_{\alpha}(r)$ which significantly improve on the only previously known universal bound of Glazyrin and Yu, as well as taking an important step towards determining $N_\alpha(r)$ exactly for all $r, \alpha$. In particular, our results imply that $N_\alpha(r) = \Theta(r)$ for $\alpha \geq \Omega(1/r^{1/5})$. Our arguments rely on a new geometric inequality for equiangular lines in $\mathbb{R}^r$ which is tight when the number of lines meets the absolute bound $\binom{r+1}{2}$. Moreover, using the connection to graphs, we obtain lower bounds on the second eigenvalue of the adjacency matrix of a regular graph which are tight for strongly regular graphs corresponding to $\binom{r+1}{2}$ equiangular lines in $\mathbb{R}^r$. Our results only require that the spectral gap is less than half the number of vertices and can therefore be seen as an extension of the Alon-Boppana theorem to dense graphs. Generalizing to $\mathbb{C}$, we also obtain the first universal bound on the maximum number of complex equiangular lines in $\mathbb{C}^r$ with common Hermitian angle $\arccos(\alpha)$. In particular, we prove an inequality for complex equiangular lines in $\mathbb{C}^r$ which is tight if the number of lines meets the absolute bound $r^2$ and may be of independent interest in quantum theory. Additionally, we use our projection method to obtain an improvement to Welch's bound.
In graph theory, as well as in 3-manifold topology, there exist several width-type parameters to describe how "simple" or "thin" a given graph or 3-manifold is. These parameters, such as pathwidth or treewidth for graphs, or the concept of thin position for 3-manifolds, play an important role when studying algorithmic problems; in particular, there is a variety of problems in computational 3-manifold topology - some of them known to be computationally hard in general - that become solvable in polynomial time as soon as the dual graph of the input triangulation has bounded treewidth. In view of these algorithmic results, it is natural to ask whether every 3-manifold admits a triangulation of bounded treewidth. We show that this is not the case, i.e., that there exists an infinite family of closed 3-manifolds not admitting triangulations of bounded pathwidth or treewidth (the latter implies the former, but we present two separate proofs). We derive these results from work of Agol, of Scharlemann and Thompson, and of Scharlemann, Schultens and Saito by exhibiting explicit connections between the topology of a 3-manifold M on the one hand and width-type parameters of the dual graphs of triangulations of M on the other hand, answering a question that had been raised repeatedly by researchers in computational 3-manifold topology. In particular, we show that if a closed, orientable, irreducible, non-Haken 3-manifold M has a triangulation of treewidth (resp. pathwidth) k then the Heegaard genus of M is at most 18(k+1) (resp. 4(3k+1)).
Entity alignment seeks to find entities in different knowledge graphs (KGs) that refer to the same real-world object. Recent advancement in KG embedding impels the advent of embedding-based entity alignment, which encodes entities in a continuous embedding space and measures entity similarities based on the learned embeddings. In this paper, we conduct a comprehensive experimental study of this emerging field. We survey 23 recent embedding-based entity alignment approaches and categorize them based on their techniques and characteristics. We also propose a new KG sampling algorithm, with which we generate a set of dedicated benchmark datasets with various heterogeneity and distributions for a realistic evaluation. We develop an open-source library including 12 representative embedding-based entity alignment approaches, and extensively evaluate these approaches, to understand their strengths and limitations. Additionally, for several directions that have not been explored in current approaches, we perform exploratory experiments and report our preliminary findings for future studies. The benchmark datasets, open-source library and experimental results are all accessible online and will be duly maintained.
The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has benefited from significant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been sufficiently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applications, from road segments to time-series in general dimension. For $\ell_p$-products of Euclidean metrics, for any $p$, we design simple and efficient data structures for ANN, based on randomized projections, which are of independent interest. They serve to solve proximity problems under a notion of distance between discretized curves, which generalizes both discrete Fr\'echet and Dynamic Time Warping distances. These are the most popular and practical approaches to comparing such curves. We offer the first data structures and query algorithms for ANN with arbitrarily good approximation factor, at the expense of increasing space usage and preprocessing time over existing methods. Query time complexity is comparable or significantly improved by our algorithms, our algorithm is especially efficient when the length of the curves is bounded.
Real-world applications often combine learning and optimization problems on graphs. For instance, our objective may be to cluster the graph in order to detect meaningful communities (or solve other common graph optimization problems such as facility location, maxcut, and so on). However, graphs or related attributes are often only partially observed, introducing learning problems such as link prediction which must be solved prior to optimization. We propose an approach to integrate a differentiable proxy for common graph optimization problems into training of machine learning models for tasks such as link prediction. This allows the model to focus specifically on the downstream task that its predictions will be used for. Experimental results show that our end-to-end system obtains better performance on example optimization tasks than can be obtained by combining state of the art link prediction methods with expert-designed graph optimization algorithms.
Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to nd the top-k matches ac- cording to a ranking function over edge and node weights. For users, it is di cult to select value k . We therefore propose the novel notion of an any-k ranking algorithm: for a given time budget, re- turn as many of the top-ranked results as possible. Then, given additional time, produce the next lower-ranked results quickly as well. It can be stopped anytime, but may have to continues until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the top-ranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.