We study the discrete bin covering problem where a multiset of items from a fixed set $S \subseteq (0,1]$ must be split into disjoint subsets while maximizing the number of subsets whose contents sum to at least $1$. We study the online discrete variant, where $S$ is finite, and items arrive sequentially. In the purely online setting, we show that the competitive ratios of best deterministic (and randomized) algorithms converge to $\frac{1}{2}$ for large $S$, similar to the continuous setting. Therefore, we consider the problem under the prediction setting, where algorithms may access a vector of frequencies predicting the frequency of items of each size in the instance. In this setting, we introduce a family of online algorithms that perform near-optimally when the predictions are correct. Further, we introduce a second family of more robust algorithms that presents a tradeoff between the performance guarantees when the predictions are perfect and when predictions are adversarial. Finally, we consider a stochastic setting where items are drawn independently from any fixed but unknown distribution of $S$. Using results from the PAC-learnability of probabilities in discrete distributions, we also introduce a purely online algorithm whose average-case performance is near-optimal with high probability for all finite sets $S$ and all distributions of $S$.
We consider the differentially private (DP) facility location problem in the so called super-set output setting proposed by Gupta et al. [SODA 2010]. The current best known expected approximation ratio for an $\epsilon$-DP algorithm is $O\left(\frac{\log n}{\sqrt{\epsilon}}\right)$ due to Cohen-Addad et al. [AISTATS 2022] where $n$ denote the size of the metric space, meanwhile the best known lower bound is $\Omega(1/\sqrt{\epsilon})$ [NeurIPS 2019]. In this short note, we give a lower bound of $\tilde{\Omega}\left(\min\left\{\log n, \sqrt{\frac{\log n}{\epsilon}}\right\}\right)$ on the expected approximation ratio of any $\epsilon$-DP algorithm, which is the first evidence that the approximation ratio has to grow with the size of the metric space.
We consider the Distinct Shortest Walks problem. Given two vertices $s$ and $t$ of a graph database $\mathcal{D}$ and a regular path query, enumerate all walks of minimal length from $s$ to $t$ that carry a label that conforms to the query. Usual theoretical solutions turn out to be inefficient when applied to graph models that are closer to real-life systems, in particular because edges may carry multiple labels. Indeed, known algorithms may repeat the same answer exponentially many times. We propose an efficient algorithm for multi-labelled graph databases. The preprocessing runs in $O{|\mathcal{D}|\times|\mathcal{A}|}$ and the delay between two consecutive outputs is in $O(\lambda\times|\mathcal{A}|)$, where $\mathcal{A}$ is a nondeterministic automaton representing the query and $\lambda$ is the minimal length. The algorithm can handle $\varepsilon$-transitions in $\mathcal{A}$ or queries given as regular expressions at no additional cost.
We give an isomorphism test that runs in time $n^{\operatorname{polylog}(h)}$ on all $n$-vertex graphs excluding some $h$-vertex vertex graph as a topological subgraph. Previous results state that isomorphism for such graphs can be tested in time $n^{\operatorname{polylog}(n)}$ (Babai, STOC 2016) and $n^{f(h)}$ for some function $f$ (Grohe and Marx, SIAM J. Comp., 2015). Our result also unifies and extends previous isomorphism tests for graphs of maximum degree $d$ running in time $n^{\operatorname{polylog}(d)}$ (SIAM J. Comp., 2023) and for graphs of Hadwiger number $h$ running in time $n^{\operatorname{polylog}(h)}$ (SIAM J. Comp., 2023).
The efficacy of machine learning has traditionally relied on the availability of increasingly larger datasets. However, large datasets pose storage challenges and contain non-influential samples, which could be ignored during training without impacting the final accuracy of the model. In response to these limitations, the concept of distilling the information on a dataset into a condensed set of (synthetic) samples, namely a distilled dataset, emerged. One crucial aspect is the selected architecture (usually ConvNet) for linking the original and synthetic datasets. However, the final accuracy is lower if the employed model architecture differs from the model used during distillation. Another challenge is the generation of high-resolution images, e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation with Diffusion Models (LD3M) that combine diffusion in latent space with dataset distillation to tackle both challenges. LD3M incorporates a novel diffusion process tailored for dataset distillation, which improves the gradient norms for learning synthetic images. By adjusting the number of diffusion steps, LD3M also offers a straightforward way of controlling the trade-off between speed and accuracy. We evaluate our approach in several ImageNet subsets and for high-resolution images (128x128 and 256x256). As a result, LD3M consistently outperforms state-of-the-art distillation techniques by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.
We study the problem of $(\epsilon,\delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.
Consider the Telephone Broadcast problem in which an input is a connected graph $G$ on $n$ vertices, a source vertex $s \in V(G)$, and a positive integer $t$. The objective is to decide whether there is a broadcast protocol from $s$ that ensures that all the vertices of $G$ get the message in at most $t$ rounds. We consider the broadcast protocol where, in a round, any node aware of the message can forward it to at most one of its neighbors. As the number of nodes aware of the message can at most double at each round, for a non-trivial instance we have $n \le 2^t$. Hence, the brute force algorithm that checks all the permutations of the vertices runs in time $2^{2^{\calO(t)}} \cdot n^{\calO(1)}$. As our first result, we prove this simple algorithm is the best possible in the following sense. Telephone Broadcast does not admit an algorithm running in time $2^{2^{o(t)}} \cdot n^{\calO(1)}$, unless the \ETH\ fails. To the best of our knowledge, this is only the fourth example of \NP-Complete problem that admits a double exponential lower bound when parameterized by the solution size. It also resolves the question by Fomin, Fraigniaud, and Golovach [WG 2023]. In the same article, the authors asked whether the problem is \FPT\ when parameterized by the feedback vertex set number of the graph. We answer this question in the negative. Telephone Broadcast, when restricted to graphs of the feedback vertex number one, and hence treewidth of two, is \NP-\complete. We find this a relatively rare example of problems that admit a polynomial-time algorithm on trees but is \NP-\complete\ on graphs of treewidth two.
The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $\Theta( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.
A subset $S$ of vertices in a graph $G$ is a secure total dominating set of $G$ if $S$ is a total dominating set of $G$ and, for each vertex $u \not\in S$, there is a vertex $v \in S$ such that $uv$ is an edge and $(S \setminus \{v\}) \cup \{u\}$ is also a total dominating set of $G$. We show that if $G$ is a maximal outerplanar graph of order $n$, then $G$ has a total secure dominating set of size at most $\lfloor 2n/3 \rfloor$. Moreover, if an outerplanar graph $G$ of order $n$, then each secure total dominating set has at least $\lceil (n+2)/3 \rceil$ vertices. We show that these bounds are best possible.
We study the problem of maintaining a lightweight bounded-degree $(1+\varepsilon)$-spanner of a dynamic point set in a $d$-dimensional Euclidean space, where $\varepsilon>0$ and $d$ are arbitrary constants. In our fully-dynamic setting, points are allowed to be inserted as well as deleted, and our objective is to maintain a $(1+\varepsilon)$-spanner that has constant bounds on its maximum degree and its lightness (the ratio of its weight to that of the minimum spanning tree), while minimizing the recourse, which is the number of edges added or removed by each point insertion or deletion. We present a fully-dynamic algorithm that handles point insertion with amortized constant recourse and point deletion with amortized $O(\log\Delta)$ recourse, where $\Delta$ is the aspect ratio of the point set.
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.