We consider the problem of privately estimating a parameter $\mathbb{E}[h(X_1,\dots,X_k)]$, where $X_1$, $X_2$, $\dots$, $X_k$ are i.i.d. data from some distribution and $h$ is a permutation-invariant function. Without privacy constraints, standard estimators are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and can be shown to be minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even $\Theta(1/n)$ rather than $O(1/n^2)$ in degenerate settings. To remedy this, we propose a new thresholding-based approach using \emph{local H\'ajek projections} to reweight different subsets of the data. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.
Given a Boolean function $f:\{0,1\}^n\to\{0,1\}$, the goal in the usual query model is to compute $f$ on an unknown input $x \in \{0,1\}^n$ while minimizing the number of queries to $x$. One can also consider a "distinguishing" problem denoted by $f_{\mathsf{sab}}$: given an input $x \in f^{-1}(0)$ and an input $y \in f^{-1}(1)$, either all differing locations are replaced by a $*$, or all differing locations are replaced by $\dagger$, and an algorithm's goal is to identify which of these is the case while minimizing the number of queries. Ben-David and Kothari [ToC'18] introduced the notion of randomized sabotage complexity of a Boolean function to be the zero-error randomized query complexity of $f_{\mathsf{sab}}$. A natural follow-up question is to understand $\mathsf{Q}(f_{\mathsf{sab}})$, the quantum query complexity of $f_{\mathsf{sab}}$. In this paper, we initiate a systematic study of this. The following are our main results: $\bullet\;\;$ If we have additional query access to $x$ and $y$, then $\mathsf{Q}(f_{\mathsf{sab}})=O(\min\{\mathsf{Q}(f),\sqrt{n}\})$. $\bullet\;\;$ If an algorithm is also required to output a differing index of a 0-input and a 1-input, then $\mathsf{Q}(f_{\mathsf{sab}})=O(\min\{\mathsf{Q}(f)^{1.5},\sqrt{n}\})$. $\bullet\;\;$ $\mathsf{Q}(f_{\mathsf{sab}}) = \Omega(\sqrt{\mathsf{fbs}(f)})$, where $\mathsf{fbs}(f)$ denotes the fractional block sensitivity of $f$. By known results, along with the results in the previous bullets, this implies that $\mathsf{Q}(f_{\mathsf{sab}})$ is polynomially related to $\mathsf{Q}(f)$. $\bullet\;\;$ The bound above is easily seen to be tight for standard functions such as And, Or, Majority and Parity. We show that when $f$ is the Indexing function, $\mathsf{Q}(f_{\mathsf{sab}})=\Theta(\mathsf{fbs}(f))$, ruling out the possibility that $\mathsf{Q}(f_{\mathsf{sab}})=\Theta(\sqrt{\mathsf{fbs}(f)})$ for all $f$.
We provide a perfect sampling algorithm for the hard-sphere model on subsets of $\mathbb{R}^d$ with expected running time linear in the volume under the assumption of strong spatial mixing. A large number of perfect and approximate sampling algorithms have been devised to sample from the hard-sphere model, and our perfect sampling algorithm is efficient for a range of parameters for which only efficient approximate samplers were previously known and is faster than these known approximate approaches. Our methods also extend to the more general setting of Gibbs point processes interacting via finite-range, repulsive potentials.
We give a simple algorithm for the dynamic approximate All-Pairs Shortest Paths (APSP) problem. Given a graph $G = (V, E, l)$ with polynomially bounded edge lengths, our data structure processes $|E|$ edge insertions and deletions in total time $|E|^{1 + o(1)}$ and provides query access to $|E|^{o(1)}$-approximate distances in time $\tilde{O}(1)$ per query. We produce a data structure that mimics Thorup-Zwick distance oracles [TZ'05], but is dynamic and deterministic. Our algorithm selects a small number of pivot vertices. Then, for every other vertex, it reduces distance computation to maintaining distances to a small neighborhood around that vertex and to the nearest pivot. We maintain distances between pivots efficiently by representing them in a smaller graph and recursing. We construct these smaller graphs by (a) reducing vertex count using the dynamic distance-preserving core graphs of Kyng-Meierhans-Probst Gutenberg [KMPG'24] in a black-box manner and (b) reducing edge-count using a dynamic spanner akin to Chen-Kyng-Liu-Meierhans-Probst Gutenberg [CKL+'24]. Our dynamic spanner internally uses an APSP data structure. Choosing a large enough size reduction factor in the first step allows us to simultaneously bootstrap our spanner and a dynamic APSP data structure. Notably, our approach does not need expander graphs, an otherwise ubiquitous tool in derandomization.
In streaming PCA, we see a stream of vectors $x_1, \dotsc, x_n \in \mathbb{R}^d$ and want to estimate the top eigenvector of their covariance matrix. This is easier if the spectral ratio $R = \lambda_1 / \lambda_2$ is large. We ask: how large does $R$ need to be to solve streaming PCA in $\widetilde{O}(d)$ space? Existing algorithms require $R = \widetilde{\Omega}(d)$. We show: (1) For all mergeable summaries, $R = \widetilde{\Omega}(\sqrt{d})$ is necessary. (2) In the insertion-only model, a variant of Oja's algorithm gets $o(1)$ error for $R = O(\log n \log d)$. (3) No algorithm with $o(d^2)$ space gets $o(1)$ error for $R = O(1)$. Our analysis is the first application of Oja's algorithm to adversarial streams. It is also the first algorithm for adversarial streaming PCA that is designed for a spectral, rather than Frobenius, bound on the tail; and the bound it needs is exponentially better than is possible by adapting a Frobenius guarantee.
We introduce the novel class $(E_\alpha)_{\alpha \in [-\infty,1)}$ of reverse map projection embeddings, each one defining a unique new method of encoding classical data into quantum states. Inspired by well-known map projections from the unit sphere onto its tangent planes, used in practice in cartography, these embeddings address the common drawback of the amplitude embedding method, wherein scalar multiples of data points are identified and information about the norm of data is lost. We show how reverse map projections can be utilised as equivariant embeddings for quantum machine learning. Using these methods, we can leverage symmetries in classical datasets to significantly strengthen performance on quantum machine learning tasks. Finally, we select four values of $\alpha$ with which to perform a simple classification task, taking $E_\alpha$ as the embedding and experimenting with both equivariant and non-equivariant setups. We compare their results alongside those of standard amplitude embedding.
We study the problem of estimating the partition function $Z(\beta) = \sum_{x \in \Omega} \exp[-\beta \cdot H(x)]$ of a Gibbs distribution defined by a Hamiltonian $H(\cdot)$. It is well known that the partition function $Z(\beta)$ can be well approximated by the simulated annealing method, assuming a sampling oracle that can generate samples according to the Gibbs distribution of any given inverse temperature $\beta$. This method yields the most efficient reductions from counting to sampling, including: $\bullet$ classic non-adaptive (parallel) algorithms with sub-optimal cost [DFK89; Bez+08]; $\bullet$ adaptive (sequential) algorithms with near-optimal cost [SVV09; Hub15; Kol18; HK23]. In this paper, we give an algorithm that achieves efficiency in both parallelism and total work. Specifically, it provides a reduction from counting to sampling using near-optimal total work and logarithmic depth of computation. Consequently, it gives work-efficient parallel counting algorithms for several important models, including the hardcore and Ising models in the uniqueness regime.
Complexity classes such as $\#\mathbf{P}$, $\oplus\mathbf{P}$, $\mathbf{GapP}$, $\mathbf{OptP}$, $\mathbf{NPMV}$, or the class of fuzzy languages realised by polynomial-time fuzzy nondeterministic Turing machines, can all be described in terms of a class $\mathbf{NP}[S]$ for a suitable semiring $S$, defined via weighted Turing machines over $S$ similarly as $\mathbf{NP}$ is defined via the classical nondeterministic Turing machines. Other complexity classes of decision problems can be lifted to the quantitative world using the same recipe as well, and the resulting classes relate to the original ones in the same way as weighted automata or logics relate to their unweighted counterparts. The article surveys these too-little-known connexions between weighted automata theory and computational complexity theory implicit in the existing literature, suggests a systematic approach to the study of weighted complexity classes, and presents several new observations strengthening the relation between both fields. In particular, it is proved that a natural extension of the Boolean satisfiability problem to weighted propositional logic is complete for the class $\mathbf{NP}[S]$ when $S$ is a finitely generated semiring. Moreover, a class of semiring-valued functions $\mathbf{FP}[S]$ is introduced for each semiring $S$ as a counterpart to the class $\mathbf{P}$, and the relations between $\mathbf{FP}[S]$ and $\mathbf{NP}[S]$ are considered.
The orthogonality dimension of a graph over $\mathbb{R}$ is the smallest integer $d$ for which one can assign to every vertex a nonzero vector in $\mathbb{R}^d$ such that every two adjacent vertices receive orthogonal vectors. For an integer $d$, the $d$-Ortho-Dim$_\mathbb{R}$ problem asks to decide whether the orthogonality dimension of a given graph over $\mathbb{R}$ is at most $d$. We prove that for every integer $d \geq 3$, the $d$-Ortho-Dim$_\mathbb{R}$ problem parameterized by the vertex cover number $k$ admits a kernel with $O(k^{d-1})$ vertices and bit-size $O(k^{d-1} \cdot \log k)$. We complement this result by a nearly matching lower bound, showing that for any $\varepsilon > 0$, the problem admits no kernel of bit-size $O(k^{d-1-\varepsilon})$ unless $\mathsf{NP} \subseteq \mathsf{coNP/poly}$. We further study the kernelizability of orthogonality dimension problems in additional settings, including over general fields and under various structural parameterizations.
We consider the problem of linearizing a pseudo-Boolean function $f : \{0,1\}^n \to \mathbb{R}$ by means of $k$ Boolean functions. Such a linearization yields an integer linear programming formulation with only $k$ auxiliary variables. This motivates the definition of the linarization complexity of $f$ as the minimum such $k$. Our theoretical contributions are the proof that random polynomials almost surely have a high linearization complexity and characterizations of its value in case we do or do not restrict the set of admissible Boolean functions. The practical relevance is shown by devising and evaluating integer linear programming models of two such linearizations for the low auto-correlation binary sequences problem. Still, many problems around this new concept remain open.
The problem of column subset selection asks for a subset of columns from an input matrix such that the matrix can be reconstructed as accurately as possible within the span of the selected columns. A natural extension is to consider a setting where the matrix rows are partitioned into two groups, and the goal is to choose a subset of columns that minimizes the maximum reconstruction error of both groups, relative to their respective best rank-k approximation. Extending the known results of column subset selection to this fair setting is not straightforward: in certain scenarios it is unavoidable to choose columns separately for each group, resulting in double the expected column count. We propose a deterministic leverage-score sampling strategy for the fair setting and show that sampling a column subset of minimum size becomes NP-hard in the presence of two groups. Despite these negative results, we give an approximation algorithm that guarantees a solution within 1.5 times the optimal solution size. We also present practical heuristic algorithms based on rank-revealing QR factorization. Finally, we validate our methods through an extensive set of experiments using real-world data.