By prior work, it is known that any distributed graph algorithm that finds a maximal matching requires $\Omega(\log^* n)$ communication rounds, while it is possible to find a maximal fractional matching in $O(1)$ rounds in bounded-degree graphs. However, all prior $O(1)$-round algorithms for maximal fractional matching use arbitrarily fine-grained fractional values. In particular, none of them is able to find a half-integral solution, using only values from $\{0, \frac12, 1\}$. We show that the use of fine-grained fractional values is necessary, and moreover we give a complete characterization on exactly how small values are needed: if we consider maximal fractional matching in graphs of maximum degree $\Delta = 2d$, and any distributed graph algorithm with round complexity $T(\Delta)$ that only depends on $\Delta$ and is independent of $n$, we show that the algorithm has to use fractional values with a denominator at least $2^d$. We give a new algorithm that shows that this is also sufficient.
Recent works have demonstrated that the sample complexity of gradient-based learning of single index models, i.e. functions that depend on a 1-dimensional projection of the input data, is governed by their information exponent. However, these results are only concerned with isotropic data, while in practice the input often contains additional structure which can implicitly guide the algorithm. In this work, we investigate the effect of a spiked covariance structure and reveal several interesting phenomena. First, we show that in the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction, even when the spike is perfectly aligned with the target direction. Next, we show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. Further, by exploiting the alignment between the (spiked) input covariance and the target, we obtain improved sample complexity compared to the isotropic case. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent while also outperforming lower bounds for rotationally invariant kernel methods.
Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, existing methods have only reached 66% accuracy (where chance is 50%). Low accuracy is often attributed to neural nets' lack of ability to find human-like symbolic rules. In this work, we point out that many existing methods are forfeiting accuracy due to a much simpler problem: they do not incorporate information contained in the support set as a whole, and rely instead on information extracted from individual supports. This is a critical issue, because unlike in few-shot learning tasks concerning object classification, the "key concept" in a typical Bongard problem can only be distinguished using multiple positives and multiple negatives. We explore a variety of simple methods to take this cross-image context into account, and demonstrate substantial gains over prior methods, leading to new state-of-the-art performance on Bongard-LOGO (75.3%) and Bongard-HOI (72.45%) and strong performance on the original Bongard problem set (60.84%).
Given a finite alphabet $A$, a quasi-metric $d$ over $A^*$, and a non-negative integer $k$, we introduce the relation $\tau_{d,k}\subseteq A^*\times A^*$ such that $(x,y)\in\tau_{d,k}$ holds whenever $d(x,y)\le k$. The error detection capability of variable-length codes is expressed in term of conditions over $\tau_{d,k}$. With respect to the prefix metric, the factor one, and any quasi-metric associated with some free monoid (anti-)automorphism, we prove that one can decide whether a given regular variable-length code satisfies any of those error detection constraints.
We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern $P[1 .. m]$ on a large repetitive text collection $T[1 .. n]$, which is represented as a (hopefully much smaller) run-length context-free grammar of size $g_{rl}$. We show that the problem can be solved in time $O(m^2 \log^\epsilon n)$, for any constant $\epsilon > 0$, on a data structure of size $O(g_{rl})$. Further, on a locally consistent grammar of size $O(\delta\log\frac{n}{\delta})$, the time decreases to $O(m\log m(\log m + \log^\epsilon n))$. The value $\delta$ is a function of the substring complexity of $T$ and $\Omega(\delta\log\frac{n}{\delta})$ is a tight lower bound on the compressibility of repetitive texts $T$, so our structure has optimal size in terms of $n$ and $\delta$. We extend our results to several related problems, such as finding $k$-MEMs, MUMs, rare MEMs, and applications.
In this work, we consider the list-decodability and list-recoverability of codes in the zero-rate regime. Briefly, a code $\mathcal{C} \subseteq [q]^n$ is $(p,\ell,L)$-list-recoverable if for all tuples of input lists $(Y_1,\dots,Y_n)$ with each $Y_i \subseteq [q]$ and $|Y_i|=\ell$ the number of codewords $c \in \mathcal{C}$ such that $c_i \notin Y_i$ for at most $pn$ choices of $i \in [n]$ is less than $L$; list-decoding is the special case of $\ell=1$. In recent work by Resch, Yuan and Zhang~(ICALP~2023) the zero-rate threshold for list-recovery was determined for all parameters: that is, the work explicitly computes $p_*:=p_*(q,\ell,L)$ with the property that for all $\epsilon>0$ (a) there exist infinite families positive-rate $(p_*-\epsilon,\ell,L)$-list-recoverable codes, and (b) any $(p_*+\epsilon,\ell,L)$-list-recoverable code has rate $0$. In fact, in the latter case the code has constant size, independent on $n$. However, the constant size in their work is quite large in $1/\epsilon$, at least $|\mathcal{C}|\geq (\frac{1}{\epsilon})^{O(q^L)}$. Our contribution in this work is to show that for all choices of $q,\ell$ and $L$ with $q \geq 3$, any $(p_*+\epsilon,\ell,L)$-list-recoverable code must have size $O_{q,\ell,L}(1/\epsilon)$, and furthermore this upper bound is complemented by a matching lower bound $\Omega_{q,\ell,L}(1/\epsilon)$. This greatly generalizes work by Alon, Bukh and Polyanskiy~(IEEE Trans.\ Inf.\ Theory~2018) which focused only on the case of binary alphabet (and thus necessarily only list-decoding). We remark that we can in fact recover the same result for $q=2$ and even $L$, as obtained by Alon, Bukh and Polyanskiy: we thus strictly generalize their work.
Broadcast protocols enable a set of $n$ parties to agree on the input of a designated sender, even facing attacks by malicious parties. In the honest-majority setting, randomization and cryptography were harnessed to achieve low-communication broadcast with sub-quadratic total communication and balanced sub-linear cost per party. However, comparatively little is known in the dishonest-majority setting. Here, the most communication-efficient constructions are based on Dolev and Strong (SICOMP '83), and sub-quadratic broadcast has not been achieved. On the other hand, the only nontrivial $\omega(n)$ communication lower bounds are restricted to deterministic protocols, or against strong adaptive adversaries that can perform "after the fact" removal of messages. We provide new communication lower bounds in this space, which hold against arbitrary cryptography and setup assumptions, as well as a simple protocol showing near tightness of our first bound. 1) We demonstrate a tradeoff between resiliency and communication for protocols secure against $n-o(n)$ static corruptions. For example, $\Omega(n\cdot {\sf polylog}(n))$ messages are needed when the number of honest parties is $n/{\sf polylog}(n)$; $\Omega(n\sqrt{n})$ messages are needed for $O(\sqrt{n})$ honest parties; and $\Omega(n^2)$ messages are needed for $O(1)$ honest parties. Complementarily, we demonstrate broadcast with $O(n\cdot{\sf polylog}(n))$ total communication facing any constant fraction of static corruptions. 2) Our second bound considers $n/2 + k$ corruptions and a weakly adaptive adversary that cannot remove messages "after the fact." We show that any broadcast protocol within this setting can be attacked to force an arbitrary party to send messages to $k$ other parties. This rules out, for example, broadcast facing 51% corruptions in which all non-sender parties have sublinear communication locality.
A pair $\langle G_0, G_1 \rangle$ of graphs admits a mutual witness proximity drawing $\langle \Gamma_0, \Gamma_1 \rangle$ when: (i) $\Gamma_i$ represents $G_i$, and (ii) there is an edge $(u,v)$ in $\Gamma_i$ if and only if there is no vertex $w$ in $\Gamma_{1-i}$ that is ``too close'' to both $u$ and $v$ ($i=0,1$). In this paper, we consider infinitely many definitions of closeness by adopting the $\beta$-proximity rule for any $\beta \in [1,\infty]$ and study pairs of isomorphic trees that admit a mutual witness $\beta$-proximity drawing. Specifically, we show that every two isomorphic trees admit a mutual witness $\beta$-proximity drawing for any $\beta \in [1,\infty]$. The constructive technique can be made ``robust'': For some tree pairs we can suitably prune linearly many leaves from one of the two trees and still retain their mutual witness $\beta$-proximity drawability. Notably, in the special case of isomorphic caterpillars and $\beta=1$, we construct linearly separable mutual witness Gabriel drawings.
Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method.
We consider the problem of dynamically maintaining the convex hull of a set $S$ of points in the plane under the following special sequence of insertions and deletions (called {\em window-sliding updates}): insert a point to the right of all points of $S$ and delete the leftmost point of $S$. We propose an $O(|S|)$-space data structure that can handle each update in $O(1)$ amortized time, such that standard binary-search-based queries on the convex hull of $S$ can be answered in $O(\log h)$ time, where $h$ is the number of vertices of the convex hull of $S$, and the convex hull itself can be output in $O(h)$ time.
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design three pre-training tasks: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Region Modeling (MRM, with three variants). Different from concurrent work on multimodal pre-training that apply joint random masking to both modalities, we use conditioned masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). Comprehensive analysis shows that conditioned masking yields better performance than unconditioned masking. We also conduct a thorough ablation study to find an optimal setting for the combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2.