In the \emph{$k$-Diameter-Optimally Augmenting Tree Problem} we are given a tree $T$ of $n$ vertices as input. The tree is embedded in an unknown \emph{metric} space and we have unlimited access to an oracle that, given two distinct vertices $u$ and $v$ of $T$, can answer queries reporting the cost of the edge $(u,v)$ in constant time. We want to augment $T$ with $k$ shortcuts in order to minimize the diameter of the resulting graph. For $k=1$, $O(n \log n)$ time algorithms are known both for paths [Wang, CG 2018] and trees [Bil\`o, TCS 2022]. In this paper we investigate the case of multiple shortcuts. We show that no algorithm that performs $o(n^2)$ queries can provide a better than $10/9$-approximate solution for trees for $k\geq 3$. For any constant $\varepsilon > 0$, we instead design a linear-time $(1+\varepsilon)$-approximation algorithm for paths and $k = o(\sqrt{\log n})$, thus establishing a dichotomy between paths and trees for $k\geq 3$. We achieve the claimed running time by designing an ad-hoc data structure, which also serves as a key component to provide a linear-time $4$-approximation algorithm for trees, and to compute the diameter of graphs with $n + k - 1$ edges in time $O(n k \log n)$ even for non-metric graphs. Our data structure and the latter result are of independent interest.
We study the complexity of evaluating queries on probabilistic databases under bag semantics. We focus on self-join free conjunctive queries, and probabilistic databases where occurrences of different facts are independent, which is the natural generalization of tuple-independent probabilistic databases to the bag semantics setting. For set semantics, the data complexity of this problem is well understood, even for the more general class of unions of conjunctive queries: it is either in polynomial time, or #P-hard, depending on the query (Dalvi & Suciu, JACM 2012). A reasonably general model of bag probabilistic databases may have unbounded multiplicities. In this case, the probabilistic database is no longer finite, and a careful treatment of representation mechanisms is required. Moreover, the answer to a Boolean query is a probability distribution over (possibly all) non-negative integers, rather than a probability distribution over { true, false }. Therefore, we discuss two flavors of probabilistic query evaluation: computing expectations of answer tuple multiplicities, and computing the probability that a tuple is contained in the answer at most k times for some parameter k. Subject to mild technical assumptions on the representation systems, it turns out that expectations are easy to compute, even for unions of conjunctive queries. For query answer probabilities, we obtain a dichotomy between solvability in polynomial time and #P-hardness for self-join free conjunctive queries.
We consider the $(1+\varepsilon)$-Approximate Nearest Neighbour (ANN) Problem for polygonal curves in $d$-dimensional space under the Fr\'echet distance and ask to what extent known data structures for doubling spaces can be applied to this problem. Initially, this approach does not seem viable, since the doubling dimension of the target space is known to be unbounded -- even for well-behaved polygonal curves of constant complexity in one dimension. In order to overcome this, we identify a subspace of curves which has bounded doubling dimension and small Gromov-Hausdorff distance to the target space. We then apply state-of-the-art techniques for doubling spaces and show how to obtain a data structure for the $(1+\varepsilon)$-ANN problem for any set of parametrized polygonal curves. The expected preprocessing time needed to construct the data-structure is $F(d,k,S,\varepsilon)n\log n$ and the space used is $F(d,k,S,\varepsilon)n$, with a query time of $F(d,k,S,\varepsilon)\log n + F(d,k,S,\varepsilon)^{-\log(\varepsilon)}$, where $F(d,k,S,\varepsilon)=O\left(2^{O(d)}k\Phi(S)\varepsilon^{-1}\right)^k$ and $\Phi(S)$ denotes the spread of the set of vertices and edges of the curves in $S$. We extend these results to the realistic class of $c$-packed curves and show improved bounds for small values of $c$.
We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database. Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration. We define the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By appropriately choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries. We show that for a restricted class of hierarchical queries, our approach achieves worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture.
We revisit the classic Pandora's Box (PB) problem under correlated distributions on the box values. Recent work of arXiv:1911.01632 obtained constant approximate algorithms for a restricted class of policies for the problem that visit boxes in a fixed order. In this work, we study the complexity of approximating the optimal policy which may adaptively choose which box to visit next based on the values seen so far. Our main result establishes an approximation-preserving equivalence of PB to the well studied Uniform Decision Tree (UDT) problem from stochastic optimization and a variant of the Min-Sum Set Cover ($\text{MSSC}_f$) problem. For distributions of support $m$, UDT admits a $\log m$ approximation, and while a constant factor approximation in polynomial time is a long-standing open problem, constant factor approximations are achievable in subexponential time (arXiv:1906.11385). Our main result implies that the same properties hold for PB and $\text{MSSC}_f$. We also study the case where the distribution over values is given more succinctly as a mixture of $m$ product distributions. This problem is again related to a noisy variant of the Optimal Decision Tree which is significantly more challenging. We give a constant-factor approximation that runs in time $n^{ \tilde O( m^2/\varepsilon^2 ) }$ when the mixture components on every box are either identical or separated in TV distance by $\varepsilon$.
We present algorithms that compute the terminal configurations for sandpile instances in $O(n \log n)$ time on trees and $O(n)$ time on paths, where $n$ is the number of vertices. The Abelian Sandpile model is a well-known model used in exploring self-organized criticality. Despite a large amount of work on other aspects of sandpiles, there have been limited results in efficiently computing the terminal state, known as the sandpile prediction problem. Our algorithm improves the previous best runtime of $O(n \log^5 n)$ on trees [Ramachandran-Schild SODA '17] and $O(n \log n)$ on paths [Moore-Nilsson '99]. To do so, we move beyond the simulation of individual events by directly computing the number of firings for each vertex. The computation is accelerated using splittable binary search trees. We also generalize our algorithm to adapt at most three sink vertices, which is the first prediction algorithm faster than mere simulation on a sandpile model with sinks. We provide a general reduction that transforms the prediction problem on an arbitrary graph into problems on its subgraphs separated by any vertex set $P$. The reduction gives a time complexity of $O(\log^{|P|} n \cdot T)$ where $T$ denotes the total time for solving on each subgraph. In addition, we give algorithms in $O(n)$ time on cliques and $O(n \log^2 n)$ time on pseudotrees.
Computing the diameter of a graph, i.e. the largest distance, is a fundamental problem that is central in fine-grained complexity. In undirected graphs, the Strong Exponential Time Hypothesis (SETH) yields a lower bound on the time vs. approximation trade-off that is quite close to the upper bounds. In \emph{directed} graphs, however, where only some of the upper bounds apply, much larger gaps remain. Since $d(u,v)$ may not be the same as $d(v,u)$, there are multiple ways to define the problem, the two most natural being the \emph{(one-way) diameter} ($\max_{(u,v)} d(u,v)$) and the \emph{roundtrip diameter} ($\max_{u,v} d(u,v)+d(v,u)$). In this paper we make progress on the outstanding open question for each of them. -- We design the first algorithm for diameter in sparse directed graphs to achieve $n^{1.5-\varepsilon}$ time with an approximation factor better than $2$. The new upper bound trade-off makes the directed case appear more similar to the undirected case. Notably, this is the first algorithm for diameter in sparse graphs that benefits from fast matrix multiplication. -- We design new hardness reductions separating roundtrip diameter from directed and undirected diameter. In particular, a $1.5$-approximation in subquadratic time would refute the All-Nodes $k$-Cycle hypothesis, and any $(2-\varepsilon)$-approximation would imply a breakthrough algorithm for approximate $\ell_{\infty}$-Closest-Pair. Notably, these are the first conditional lower bounds for diameter that are not based on SETH.
The fundamental theorem of Tur\'{a}n from Extremal Graph Theory determines the exact bound on the number of edges $t_r(n)$ in an $n$-vertex graph that does not contain a clique of size $r+1$. We establish an interesting link between Extremal Graph Theory and Algorithms by providing a simple compression algorithm that in linear time reduces the problem of finding a clique of size $\ell$ in an $n$-vertex graph $G$ with $m \ge t_r(n)-k$ edges, where $\ell\leq r+1$, to the problem of finding a maximum clique in a graph on at most $5k$ vertices. This also gives us an algorithm deciding in time $2.49^{k}\cdot(n + m)$ whether $G$ has a clique of size $\ell$. As a byproduct of the new compression algorithm, we give an algorithm that in time $2^{\mathcal{O}(td^2)} \cdot n^2$ decides whether a graph contains an independent set of size at least $n/(d+1) + t$. Here $d$ is the average vertex degree of the graph $G$. The multivariate complexity analysis based on ETH indicates that the asymptotical dependence on several parameters in the running times of our algorithms is tight.
We study the Feedback Vertex Set and the Vertex Cover problem in a natural variant of the classical online model that allows for delayed decisions and reservations. Both problems can be characterized by an obstruction set of subgraphs that the online graph needs to avoid. In the case of the Vertex Cover problem, the obstruction set consists of an edge (i.e., the graph of two adjacent vertices), while for the Feedback Vertex Set problem, the obstruction set contains all cycles. In the delayed-decision model, an algorithm needs to maintain a valid partial solution after every request, thus allowing it to postpone decisions until the current partial solution is no longer valid for the current request. The reservation model grants an online algorithm the new and additional option to pay a so-called reservation cost for any given element in order to delay the decision of adding or rejecting it until the end of the instance. For the Feedback Vertex Set problem, we first analyze the variant with only delayed decisions, proving a lower bound of $4$ and an upper bound of $5$ on the competitive ratio. Then we look at the variant with both delayed decisions and reservation. We show that given bounds on the competitive ratio of a problem with delayed decisions impliy lower and upper bounds for the same problem when adding the option of reservations. This observation allows us to give a lower bound of $\min{\{1+3\alpha,4\}}$ and an upper bound of $\min{\{1+5\alpha,5\}}$ for the Feedback Vertex Set problem. Finally, we show that the online Vertex Cover problem, when both delayed decisions and reservations are allowed, is $\min{\{1+2\alpha, 2\}}$-competitive, where $\alpha \in \mathbb{R}_{\geq 0}$ is the reservation cost per reserved vertex.
Residual bootstrap is a classical method for statistical inference in regression settings. With massive data sets becoming increasingly common, there is a demand for computationally efficient alternatives to residual bootstrap. We propose a simple and versatile scalable algorithm called subsampled residual bootstrap (SRB) for generalized linear models (GLMs), a large class of regression models that includes the classical linear regression model as well as other widely used models such as logistic, Poisson and probit regression. We prove consistency and distributional results that establish that the SRB has the same theoretical guarantees under the GLM framework as the classical residual bootstrap, while being computationally much faster. We demonstrate the empirical performance of SRB via simulation studies and a real data analysis of the Forest Covertype data from the UCI Machine Learning Repository.
In this paper, we consider algorithms for edge-coloring multigraphs $G$ of bounded maximum degree, i.e., $\Delta(G) = O(1)$. Shannon's theorem states that any multigraph of maximum degree $\Delta$ can be properly edge-colored with $\lfloor 3\Delta/2\rfloor$ colors. Our main results include algorithms for computing such colorings. We design deterministic and randomized sequential algorithms with running time $O(n\log n)$ and $O(n)$, respectively. This is the first improvement since the $O(n^2)$ algorithm in Shannon's original paper, and our randomized algorithm is optimal up to constant factors. We also develop distributed algorithms in the $\mathsf{LOCAL}$ model of computation. Namely, we design deterministic and randomized $\mathsf{LOCAL}$ algorithms with running time $\tilde O(\log^5 n)$ and $O(\log^2n)$, respectively. The deterministic sequential algorithm is a simplified extension of earlier work of Gabow et al. in edge-coloring simple graphs. The other algorithms apply the entropy compression method in a similar way to recent work by the author and Bernshteyn, where the authors design algorithms for Vizing's theorem for simple graphs. We also extend their results to Vizing's theorem for multigraphs.