A nearest neighbor representation of a Boolean function $f$ is a set of vectors (anchors) labeled by $0$ or $1$ such that $f(\vec{x}) = 1$ if and only if the closest anchor to $\vec{x}$ is labeled by $1$. This model was introduced by Hajnal, Liu, and Tur\'an (2022), who studied bounds on the number of anchors required to represent Boolean functions under different choices of anchors (real vs. Boolean vectors) as well as the more expressive model of $k$-nearest neighbors. We initiate the study of the representational power of nearest and $k$-nearest neighbors through Boolean circuit complexity. To this end, we establish a connection between Boolean functions with polynomial nearest neighbor complexity and those that can be efficiently represented by classes based on linear inequalities -- min-plus polynomial threshold functions -- previously studied in relation to threshold circuits. This extends an observation of Hajnal et al. (2022). We obtain exponential lower bounds on the $k$-nearest neighbors complexity of explicit $n$-variate functions, assuming $k \leq n^{1-\epsilon}$. Previously, no superlinear lower bound was known for any $k>1$. Next, we further extend the connection between nearest neighbor representations and circuits to the $k$-nearest neighbors case. As a result, we show that proving superpolynomial lower bounds for the $k$-nearest neighbors complexity of an explicit function for arbitrary $k$ would require a breakthrough in circuit complexity. In addition, we prove an exponential separation between the nearest neighbor and $k$-nearest neighbors complexity (for unrestricted $k$) of an explicit function. These results address questions raised by Hajnal et al. (2022) of proving strong lower bounds for $k$-nearest neighbors and understanding the role of the parameter $k$. Finally, we devise new bounds on the nearest neighbor complexity for several explicit functions.
In batch Kernel Density Estimation (KDE) for a kernel function $f$, we are given as input $2n$ points $x^{(1)}, \cdots, x^{(n)}, y^{(1)}, \cdots, y^{(n)}$ in dimension $m$, as well as a vector $v \in \mathbb{R}^n$. These inputs implicitly define the $n \times n$ kernel matrix $K$ given by $K[i,j] = f(x^{(i)}, y^{(j)})$. The goal is to compute a vector $v$ which approximates $K w$ with $|| Kw - v||_\infty < \varepsilon ||w||_1$. A recent line of work has proved fine-grained lower bounds conditioned on SETH. Backurs et al. first showed the hardness of KDE for Gaussian-like kernels with high dimension $m = \Omega(\log n)$ and large scale $B = \Omega(\log n)$. Alman et al. later developed new reductions in roughly this same parameter regime, leading to lower bounds for more general kernels, but only for very small error $\varepsilon < 2^{- \log^{\Omega(1)} (n)}$. In this paper, we refine the approach of Alman et al. to show new lower bounds in all parameter regimes, closing gaps between the known algorithms and lower bounds. In the setting where $m = C\log n$ and $B = o(\log n)$, we prove Gaussian KDE requires $n^{2-o(1)}$ time to achieve additive error $\varepsilon < \Omega(m/B)^{-m}$, matching the performance of the polynomial method up to low-order terms. In the low dimensional setting $m = o(\log n)$, we show that Gaussian KDE requires $n^{2-o(1)}$ time to achieve $\varepsilon$ such that $\log \log (\varepsilon^{-1}) > \tilde \Omega ((\log n)/m)$, matching the error bound achievable by FMM up to low-order terms. To our knowledge, no nontrivial lower bound was previously known in this regime. Our new lower bounds make use of an intricate analysis of a special case of the kernel matrix -- the `counting matrix'. As a key technical lemma, we give a novel approach to bounding the entries of its inverse by using Schur polynomials from algebraic combinatorics.
Let $\mathsf{TH}_k$ denote the $k$-out-of-$n$ threshold function: given $n$ input Boolean variables, the output is $1$ if and only if at least $k$ of the inputs are $1$. We consider the problem of computing the $\mathsf{TH}_k$ function using noisy readings of the Boolean variables, where each reading is incorrect with some fixed and known probability $p \in (0,1/2)$. As our main result, we show that it is sufficient to use $(1+o(1)) \frac{n\log \frac{m}{\delta}}{D_{\mathsf{KL}}(p \| 1-p)}$ queries in expectation to compute the $\mathsf{TH}_k$ function with a vanishing error probability $\delta = o(1)$, where $m\triangleq \min\{k,n-k\}$ and $D_{\mathsf{KL}}(p \| 1-p)$ denotes the Kullback-Leibler divergence between $\mathsf{Bern}(p)$ and $\mathsf{Bern}(1-p)$ distributions. Conversely, we show that any algorithm achieving an error probability of $\delta = o(1)$ necessitates at least $(1-o(1))\frac{(n-m)\log\frac{m}{\delta}}{D_{\mathsf{KL}}(p \| 1-p)}$ queries in expectation. The upper and lower bounds are tight when $m=o(n)$, and are within a multiplicative factor of $\frac{n}{n-m}$ when $m=\Theta(n)$. In particular, when $k=n/2$, the $\mathsf{TH}_k$ function corresponds to the $\mathsf{MAJORITY}$ function, in which case the upper and lower bounds are tight up to a multiplicative factor of two. Compared to previous work, our result tightens the dependence on $p$ in both the upper and lower bounds.
In the Directed Steiner Tree (DST) problem the input is a directed edge-weighted graph $G=(V,E)$, a root vertex $r$ and a set $S \subseteq V$ of $k$ terminals. The goal is to find a min-cost subgraph that connects $r$ to each of the terminals. DST admits an $O(\log^2 k/\log \log k)$-approximation in quasi-polynomial time, and an $O(k^{\epsilon})$-approximation for any fixed $\epsilon > 0$ in polynomial-time. Resolving the existence of a polynomial-time poly-logarithmic approximation is a major open problem in approximation algorithms. In a recent work, Friggstad and Mousavi [ICALP 2023] obtained a simple and elegant polynomial-time $O(\log k)$-approximation for DST in planar digraphs via Thorup's shortest path separator theorem. We build on their work and obtain several new results on DST and related problems. - We develop a tree embedding technique for rooted problems in planar digraphs via an interpretation of the recursion in Friggstad and Mousavi [ICALP 2023]. Using this we obtain polynomial-time poly-logarithmic approximations for Group Steiner Tree, Covering Steiner Tree, and the Polymatroid Steiner Tree problems in planar digraphs. All these problems are hard to approximate to within a factor of $\Omega(\log^2 n/\log \log n)$ even in trees. - We prove that the natural cut-based LP relaxation for DST has an integrality gap of $O(\log^2 k)$ in planar graphs. This is in contrast to general graphs where the integrality gap of this LP is known to be $\Omega(k)$ and $\Omega(n^{\delta})$ for some fixed $\delta > 0$. - We combine the preceding results with density based arguments to obtain poly-logarithmic approximations for the multi-rooted versions of the problems in planar digraphs. For DST our result improves the $O(R + \log k)$ approximation of Friggstad and Mousavi [ICALP 2023] when $R= \omega(\log^2 k)$.
We prove a lower bound on the communication complexity of computing the $n$-fold xor of an arbitrary function $f$, in terms of the communication complexity and rank of $f$. We prove that $D(f^{\oplus n}) \geq n \cdot \Big(\frac{\Omega(D(f))}{\log \mathsf{rk}(f)} -\log \mathsf{rk}(f)\Big )$, where here $D(f), D(f^{\oplus n})$ represent the deterministic communication complexity, and $\mathsf{rk}(f)$ is the rank of $f$. Our methods involve a new way to use information theory to reason about deterministic communication complexity.
Let $P$ be a set of $n$ points in $\mathbb{R}^d$, and let $\varepsilon,\psi \in (0,1)$ be parameters. Here, we consider the task of constructing a $(1+\varepsilon)$-spanner for $P$, where every edge might fail (independently) with probability $1-\psi$. For example, for $\psi=0.1$, about $90\%$ of the edges of the graph fail. Nevertheless, we show how to construct a spanner that survives such a catastrophe with near linear number of edges. The measure of reliability of the graph constructed is how many pairs of vertices lose $(1+\varepsilon)$-connectivity. Surprisingly, despite the spanner constructed being of near linear size, the number of failed pairs is close to the number of failed pairs if the underlying graph was a clique. Specifically, we show how to construct such an exact dependable spanner in one dimension of size $O(\tfrac{n}{\psi} \log n)$, which is optimal. Next, we build an $(1+\varepsilon)$-spanners for a set $P \subseteq \mathbb{R}^d$ of $n$ points, of size $O( C n \log n )$, where $C \approx 1/\bigl(\varepsilon^{d} \psi^{4/3}\bigr)$. Surprisingly, these new spanners also have the property that almost all pairs of vertices have a $\leq 4$-hop paths between them realizing this short path.
Given a set $P$ of $n$ points and a set $S$ of $m$ disks in the plane, the disk hitting set problem asks for a smallest subset of $P$ such that every disk of $S$ contains at least one point in the subset. The problem is NP-hard. In this paper, we consider a line-constrained version in which all disks have their centers on a line. We present an $O(m\log^2n+(n+m)\log(n+m))$ time algorithm for the problem. This improves the previously best result of $O(m^2\log m+(n+m)\log(n+m))$ time for the weighted case of the problem where every point of $P$ has a weight and the objective is to minimize the total weight of the hitting set. Our algorithm actually solves a more general line-separable problem with a single intersection property: The points of $P$ and the disk centers are separated by a line $\ell$ and the boundary of every two disks intersect at most once on the side of $\ell$ containing $P$.
Given a set $P$ of $n$ points and a set $S$ of $n$ weighted disks in the plane, the disk coverage problem is to compute a subset of disks of smallest total weight such that the union of the disks in the subset covers all points of $P$. The problem is NP-hard. In this paper, we consider a line-separable unit-disk version of the problem where all disks have the same radius and their centers are separated from the points of $P$ by a line $\ell$. We present an $O(n^{3/2}\log^2 n)$ time algorithm for the problem. This improves the previously best work of $O(n^2\log n)$ time. Our result leads to an algorithm of $O(n^{{7}/{2}}\log^2 n)$ time for the halfplane coverage problem (i.e., using $n$ weighted halfplanes to cover $n$ points), an improvement over the previous $O(n^4\log n)$ time solution. If all halfplanes are lower ones, our algorithm runs in $O(n^{{3}/{2}}\log^2 n)$ time, while the previous best algorithm takes $O(n^2\log n)$ time. Using duality, the hitting set problems under the same settings can be solved with similar time complexities.
Given a (multi)graph $G$ which contains a bipartite subgraph with $\rho$ edges, what is the largest triangle-free subgraph of $G$ that can be found efficiently? We present an SDP-based algorithm that finds one with at least $0.8823 \rho$ edges, thus improving on the subgraph with $0.878 \rho$ edges obtained by the classic Max-Cut algorithm of Goemans and Williamson. On the other hand, by a reduction from Hastad's 3-bit PCP we show that it is NP-hard to find a triangle-free subgraph with $(25 / 26 + \epsilon) \rho \approx (0.961 + \epsilon) \rho$ edges. As an application, we classify the Maximum Promise Constraint Satisfaction Problem MaxPCSP($G$,$H$) for all bipartite $G$: Given an input (multi)graph $X$ which admits a $G$-colouring satisfying $\rho$ edges, find an $H$-colouring of $X$ that satisfies $\rho$ edges. This problem is solvable in polynomial time, apart from trivial cases, if $H$ contains a triangle, and is NP-hard otherwise.
Sorting has a natural generalization where the input consists of: (1) a ground set $X$ of size $n$, (2) a partial oracle $O_P$ specifying some fixed partial order $P$ on $X$ and (3) a linear oracle $O_L$ specifying a linear order $L$ that extends $P$. The goal is to recover the linear order $L$ on $X$ using the fewest number of linear oracle queries. In this problem, we measure algorithmic complexity through three metrics: oracle queries to $O_L$, oracle queries to $O_P$, and the time spent. Any algorithm requires worst-case $\log_2 e(P)$ linear oracle queries to recover the linear order on $X$. Kahn and Saks presented the first algorithm that uses $\Theta(\log e(P))$ linear oracle queries (using $O(n^2)$ partial oracle queries and exponential time). The state-of-the-art for the general problem is by Cardinal, Fiorini, Joret, Jungers and Munro who at STOC'10 manage to separate the linear and partial oracle queries into a preprocessing and query phase. They can preprocess $P$ using $O(n^2)$ partial oracle queries and $O(n^{2.5})$ time. Then, given $O_L$, they uncover the linear order on $X$ in $\Theta(\log e(P))$ linear oracle queries and $O(n + \log e(P))$ time -- which is worst-case optimal in the number of linear oracle queries but not in the time spent. For $c \geq 1$, our algorithm can preprocess $O_P$ using $O(n^{1 + \frac{1}{c}})$ queries and time. Given $O_L$, we uncover $L$ using $\Theta(c \log e(P))$ queries and time. We show a matching lower bound, as there exist positive constants $(\alpha, \beta)$ where for any constant $c \geq 1$, any algorithm that uses at most $\alpha \cdot n^{1 + \frac{1}{c}}$ preprocessing must use worst-case at least $\beta \cdot c \log e(P)$ linear oracle queries. Thus, we solve the problem of sorting under partial information through an algorithm that is asymptotically tight across all three metrics.
We consider the problem of sampling from the posterior distribution of a $d$-dimensional coefficient vector $\boldsymbol{\theta}$, given linear observations $\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\theta}+\boldsymbol{\varepsilon}$. In general, such posteriors are multimodal, and therefore challenging to sample from. This observation has prompted the exploration of various heuristics that aim at approximating the posterior distribution. In this paper, we study a different approach based on decomposing the posterior distribution into a log-concave mixture of simple product measures. This decomposition allows us to reduce sampling from a multimodal distribution of interest to sampling from a log-concave one, which is tractable and has been investigated in detail. We prove that, under mild conditions on the prior, for random designs, such measure decomposition is generally feasible when the number of samples per parameter $n/d$ exceeds a constant threshold. We thus obtain a provably efficient (polynomial time) sampling algorithm in a regime where this was previously not known. Numerical simulations confirm that the algorithm is practical, and reveal that it has attractive statistical properties compared to state-of-the-art methods.