Given a set of $n$ sites from $\mathbb{R}^d$, each having some positive weight factor, the Multiplicatively Weighted Voronoi Diagram is a subdivision of space that associates each cell to the site whose weighted Euclidean distance is minimal for all points in the cell. We give novel approximation algorithms that output a cube-based subdivision such that the weighted distance of a point with respect to the associated site is at most $(1+\varepsilon)$ times the minimum weighted distance, for any fixed parameter $\varepsilon \in (0,1)$. The diagram size is $O_d(n \log(1/\varepsilon)/\varepsilon^{d-1})$ and the construction time is within an $O_D(\log(n)/\varepsilon^{(d+5)/2})$-factor of the size bound. We also prove a matching lower bound for the size, showing that the proposed method is the first to achieve \emph{optimal size}, up to $\Theta(1)^d$-factors. In particular, the obscure $\log(1/\varepsilon)$ factor is unavoidable. As a by-product, we obtain a factor $d^{O(d)}$ improvement in size for the unweighted case and $O(d \log(n) + d^2 \log(1/\varepsilon))$ point-location time in the subdivision, improving the known query bound by one $d$-factor. The key ingredients of our approximation algorithms are the study of convex regions that we call cores, an adaptive refinement algorithm to obtain optimal size, and a novel notion of \emph{bisector coresets}, which may be of independent interest. In particular, we show that coresets with $O_d(1/\varepsilon^{(d+3)/2})$ worst-case size can be computed in near-linear time.
Let $S$ be a set of $n$ sites in the plane, so that every site $s \in S$ has an associated radius $r_s > 0$. Let $\mathcal{D}(S)$ be the disk intersection graph defined by $S$, i.e., the graph with vertex set $S$ and an edge between two distinct sites $s, t \in S$ if and only if the disks with centers $s$, $t$ and radii $r_s$, $r_t$ intersect.Our goal is to design data structures that maintain the connectivity structure of $\mathcal{D}(S)$ as sites are inserted and/or deleted in $S$.
Posterior sampling, i.e., exponential mechanism to sample from the posterior distribution, provides $\varepsilon$-pure differential privacy (DP) guarantees and does not suffer from potentially unbounded privacy breach introduced by $(\varepsilon,\delta)$-approximate DP. In practice, however, one needs to apply approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus re-introducing the unappealing $\delta$-approximation error into the privacy guarantees. To bridge this gap, we propose the Approximate SAample Perturbation (abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to its Wasserstein-infinity ($W_\infty$) distance from a reference distribution that satisfies pure DP or pure Gaussian DP (i.e., $\delta=0$). We then leverage a Metropolis-Hastings algorithm to generate the sample and prove that the algorithm converges in $W_\infty$ distance. We show that by combining our new techniques with a localization step, we obtain the first nearly linear-time algorithm that achieves the optimal rates in the DP-ERM problem with strongly convex and smooth losses.
We study the complexity of a fundamental algorithm for fairly allocating indivisible items, the round-robin algorithm. For $n$ agents and $m$ items, we show that the algorithm can be implemented in time $O(nm\log(m/n))$ in the worst case. If the agents' preferences are uniformly random, we establish an improved (expected) running time of $O(nm + m\log m)$. On the other hand, assuming comparison queries between items, we prove that $\Omega(nm + m\log m)$ queries are necessary to implement the algorithm, even when randomization is allowed. We also derive bounds in noise models where the answers to queries are incorrect with some probability. Our proofs involve novel applications of tools from multi-armed bandit, information theory, as well as posets and linear extensions.
In this paper, the key objects of interest are the sequential covariance matrices $\mathbf{S}_{n,t}$ and their largest eigenvalues. Here, the matrix $\mathbf{S}_{n,t}$ is computed as the empirical covariance associated with observations $\{\mathbf{x}_1,\ldots,\mathbf{x}_{ \lfloor nt \rfloor } \}$, for $t\in [0,1]$. The observations $\mathbf{x}_1,\ldots,\mathbf{x}_n$ are assumed to be i.i.d. $p$-dimensional vectors with zero mean, and a covariance matrix that is a fixed-rank perturbation of the identity matrix. Treating $\{ \mathbf{S}_{n,t}\}_{t \in [0,1]}$ as a matrix-valued stochastic process indexed by $t$, we study the behavior of the largest eigenvalues of $\mathbf{S}_{n,t}$, as $t$ varies, with $n$ and $p$ increasing simultaneously, so that $p/n \to y \in (0,1)$. As a key contribution of this work, we establish the weak convergence of the stochastic process corresponding to the sample spiked eigenvalues, if their population counterparts exceed the critical phase-transition threshold. Our analysis of the limiting process is fully comprehensive revealing, in general, non-Gaussian limiting processes. As an application, we consider a class of change-point problems, where the interest is in detecting structural breaks in the covariance caused by a change in magnitude of the spiked eigenvalues. For this purpose, we propose two different maximal statistics corresponding to centered spiked eigenvalues of the sequential covariances. We show the existence of limiting null distributions for these statistics, and prove consistency of the test under fixed alternatives. Moreover, we compare the behavior of the proposed tests through a simulation study.
Symbolic Regression (SR) is a task which aims to extract the mathematical expression underlying a set of empirical observations. Transformer-based methods trained on SR datasets detain the current state-of-the-art in this task, while the application of Large Language Models (LLMs) to SR remains unexplored. This work investigates the integration of pre-trained LLMs into the SR pipeline, utilizing an approach that iteratively refines a functional form based on the prediction error it achieves on the observation set, until it reaches convergence. Our method leverages LLMs to propose an initial set of possible functions based on the observations, exploiting their strong pre-training prior. These functions are then iteratively refined by the model itself and by an external optimizer for their coefficients. The process is repeated until the results are satisfactory. We then analyze Vision-Language Models in this context, exploring the inclusion of plots as visual inputs to aid the optimization process. Our findings reveal that LLMs are able to successfully recover good symbolic equations that fit the given data, outperforming SR baselines based on Genetic Programming, with the addition of images in the input showing promising results for the most complex benchmarks.
We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number.
A minimal perfect hash function (MPHF) maps a set of n keys to {1, ..., n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places its keys in the output domain without collisions. The collection of all seeds is then stored in a compressed way. Since the first buckets are easier to place, buckets are considered in non-increasing order of size. Additionally, PTHash heuristically produces an imbalanced distribution of bucket sizes by distributing 60% of the keys into 30% of the buckets. Our main contribution is to characterize, up to lower order terms, an optimal distribution of expected bucket sizes. We arrive at a simple, closed form solution which improves construction throughput for space efficient configurations in practice. Our second contribution is a novel encoding scheme for the seeds. We split the keys into partitions. Within each partition, we run the bucket distribution and search step. We then store the seeds in an interleaved way by consecutively placing the seeds for the i-th buckets from all partitions. The seeds for the i-th bucket of each partition follow the same statistical distribution. This allows us to tune a compressor for each bucket. Hence, we call our technique PHOBIC - Perfect Hashing with Optimized Bucket sizes and Interleaved Coding. Compared to PTHash, PHOBIC is 0.17 bits/key more space efficient for same query time and construction throughput. We also contribute a GPU implementation to further accelerate MPHF construction. For a configuration with fast queries, PHOBIC-GPU can construct a perfect hash function at 2.17 bits/key in 28 ns per key, which can be queried in 37 ns on the CPU.
We construct $n$-node graphs on which any $O(n)$-size spanner has additive error at least $+\Omega(n^{3/17})$, improving on the previous best lower bound of $\Omega(n^{1/7})$ [Bodwin-Hoppenworth FOCS '22]. Our construction completes the first two steps of a particular three-step research program, introduced in prior work and overviewed here, aimed at producing tight bounds for the problem by aligning aspects of the upper and lower bound constructions. More specifically, we develop techniques that enable the use of inner graphs in the lower bound framework whose technical properties are provably tight with the corresponding assumptions made in the upper bounds. As an additional application of our techniques, we improve the corresponding lower bound for $O(n)$-size additive emulators to $+\Omega(n^{1/14})$.
We consider the task of locally correcting, and locally list-correcting, multivariate linear functions over the domain $\{0,1\}^n$ over arbitrary fields and more generally Abelian groups. Such functions form error-correcting codes of relative distance $1/2$ and we give local-correction algorithms correcting up to nearly $1/4$-fraction errors making $\widetilde{\mathcal{O}}(\log n)$ queries. This query complexity is optimal up to $\mathrm{poly}(\log\log n)$ factors. We also give local list-correcting algorithms correcting $(1/2 - \varepsilon)$-fraction errors with $\widetilde{\mathcal{O}}_{\varepsilon}(\log n)$ queries. These results may be viewed as natural generalizations of the classical work of Goldreich and Levin whose work addresses the special case where the underlying group is $\mathbb{Z}_2$. By extending to the case where the underlying group is, say, the reals, we give the first non-trivial locally correctable codes (LCCs) over the reals (with query complexity being sublinear in the dimension (also known as message length)). The central challenge in constructing the local corrector is constructing "nearly balanced vectors" over $\{-1,1\}^n$ that span $1^n$ -- we show how to construct $\mathcal{O}(\log n)$ vectors that do so, with entries in each vector summing to $\pm1$. The challenge to the local-list-correction algorithms, given the local corrector, is principally combinatorial, i.e., in proving that the number of linear functions within any Hamming ball of radius $(1/2-\varepsilon)$ is $\mathcal{O}_{\varepsilon}(1)$. Getting this general result covering every Abelian group requires integrating a variety of known methods with some new combinatorial ingredients analyzing the structural properties of codewords that lie within small Hamming balls.
We propose an exact algorithm for the Graph Burning Problem ($\texttt{GBP}$), an NP-hard optimization problem that models the spread of influence on social networks. Given a graph $G$ with vertex set $V$, the objective is to find a sequence of $k$ vertices in $V$, namely, $v_1, v_2, \dots, v_k$, such that $k$ is minimum and $\bigcup_{i = 1}^{k} \{u\! \in\! V\! : d(u, v_i) \leq k - i\} = V$, where $d(u,v)$ denotes the distance between $u$ and $v$. We formulate the problem as a set covering integer programming model and design a row generation algorithm for the $\texttt{GBP}$. Our method exploits the fact that a very small number of covering constraints is often sufficient for solving the integer model, allowing the corresponding rows to be generated on demand. To date, the most efficient exact algorithm for the $\texttt{GBP}$, denoted here by $\texttt{GDCA}$, is able to obtain optimal solutions for graphs with up to 14,000 vertices within two hours of execution. In comparison, our algorithm finds provably optimal solutions approximately 236 times faster, on average, than $\texttt{GDCA}$. For larger graphs, memory space becomes a limiting factor for $\texttt{GDCA}$. Our algorithm, however, solves real-world instances with almost 200,000 vertices in less than 35 seconds, increasing the size of graphs for which optimal solutions are known by a factor of 14.