The Gromov-Wasserstein (GW) transport problem is a relaxation of classic optimal transport, which seeks a transport between two measures while preserving their internal geometry. Due to meeting this theoretical underpinning, it is a valuable tool for the analysis of objects that do not possess a natural embedding or should be studied independently of it. Prime applications can thus be found in e.g. shape matching, classification and interpolation tasks. To tackle the latter, one theoretically justified approach is the employment of multi-marginal GW transport and GW barycenters, which are Fr\'echet means with respect to the GW distance. However, because the computation of GW itself already poses a quadratic and non-convex optimization problem, the determination of GW barycenters is a hard task and algorithms for their computation are scarce. In this paper, we revisit a known procedure for the determination of Fr\'echet means in Riemannian manifolds via tangential approximations in the context of GW. We provide a characterization of barycenters in the GW tangent space, which ultimately gives rise to a fixpoint iteration for approximating GW barycenters using multi-marginal plans. We propose a relaxation of this fixpoint iteration and show that it monotonously decreases the barycenter loss. In certain cases our proposed method naturally provides us with barycentric embeddings. The resulting algorithm is capable of producing qualitative shape interpolations between multiple 3d shapes with support sizes of over thousands of points in reasonable time. In addition, we verify our method on shape classification and multi-graph matching tasks.
Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
Qini curves have emerged as an attractive and popular approach for evaluating the benefit of data-driven targeting rules for treatment allocation. We propose a generalization of the Qini curve to multiple costly treatment arms, that quantifies the value of optimally selecting among both units and treatment arms at different budget levels. We develop an efficient algorithm for computing these curves and propose bootstrap-based confidence intervals that are exact in large samples for any point on the curve. These confidence intervals can be used to conduct hypothesis tests comparing the value of treatment targeting using an optimal combination of arms with using just a subset of arms, or with a non-targeting assignment rule ignoring covariates, at different budget levels. We demonstrate the statistical performance in a simulation experiment and an application to treatment targeting for election turnout.
Regular Path Queries (RPQs), which are essentially regular expressions to be matched against the labels of paths in labeled graphs, are at the core of graph database query languages like SPARQL. A way to solve RPQs is to translate them into a sequence of operations on the adjacency matrices of each label. We design and implement a Boolean algebra on sparse matrix representations and, as an application, use them to handle RPQs. Our baseline representation uses the same space as the previously most compact index for RPQs and outperforms it on the hardest types of queries -- those where both RPQ endpoints are unspecified. Our more succinct structure, based on $k^2$-trees, is 4 times smaller than any existing representation that handles RPQs, and still solves complex RPQs in a few seconds. Our new sparse-matrix-based representations dominate a good portion of the space/time tradeoff map, being outperformed only by representations that use much more space. They are also of independent interest beyond solving RPQs.
The constraint satisfaction problem (CSP) on a finite relational structure B is to decide, given a set of constraints on variables where the relations come from B, whether or not there is a assignment to the variables satisfying all of the constraints; the surjective CSP is the variant where one decides the existence of a surjective satisfying assignment onto the universe of B. We present an algebraic framework for proving hardness results on surjective CSPs; essentially, this framework computes global gadgetry that permits one to present a reduction from a classical CSP to a surjective CSP. We show how to derive a number of hardness results for surjective CSP in this framework, including the hardness of the disconnected cut problem, of the no-rainbow 3-coloring problem, and of the surjective CSP on all 2-element structures known to be intractable (in this setting). Our framework thus allows us to unify these hardness results, and reveal common structure among them; we believe that our hardness proof for the disconnected cut problem is more succinct than the original. In our view, the framework also makes very transparent a way in which classical CSPs can be reduced to surjective CSPs.
We develop the notion of a locally homomorphic channel and prove an approximate equivalence between those and codes for computing functions. Further, we derive decomposition properties of locally homomorphic channels which we use to analyze and construct codes where two messages must be encoded independently. This leads to new results for identification and K-identification when all messages are sent over multiple-access channels, which yield surprising rate improvements compared to naive code constructions. In particular, we demonstrate that for the example of identification with deterministic encoders, both encoders can be constructed independently.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
Federated Learning (FL) is a decentralized machine-learning paradigm, in which a global server iteratively averages the model parameters of local users without accessing their data. User heterogeneity has imposed significant challenges to FL, which can incur drifted global models that are slow to converge. Knowledge Distillation has recently emerged to tackle this issue, by refining the server model using aggregated knowledge from heterogeneous users, other than directly averaging their model parameters. This approach, however, depends on a proxy dataset, making it impractical unless such a prerequisite is satisfied. Moreover, the ensemble knowledge is not fully utilized to guide local model learning, which may in turn affect the quality of the aggregated model. Inspired by the prior art, we propose a data-free knowledge distillation} approach to address heterogeneous FL, where the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias. Empirical studies powered by theoretical implications show that, our approach facilitates FL with better generalization performance using fewer communication rounds, compared with the state-of-the-art.
Few-shot Knowledge Graph (KG) completion is a focus of current research, where each task aims at querying unseen facts of a relation given its few-shot reference entity pairs. Recent attempts solve this problem by learning static representations of entities and references, ignoring their dynamic properties, i.e., entities may exhibit diverse roles within task relations, and references may make different contributions to queries. This work proposes an adaptive attentional network for few-shot KG completion by learning adaptive entity and reference representations. Specifically, entities are modeled by an adaptive neighbor encoder to discern their task-oriented roles, while references are modeled by an adaptive query-aware aggregator to differentiate their contributions. Through the attention mechanism, both entities and references can capture their fine-grained semantic meanings, and thus render more expressive representations. This will be more predictive for knowledge acquisition in the few-shot scenario. Evaluation in link prediction on two public datasets shows that our approach achieves new state-of-the-art results with different few-shot sizes.
Domain shift is a fundamental problem in visual recognition which typically arises when the source and target data follow different distributions. The existing domain adaptation approaches which tackle this problem work in the closed-set setting with the assumption that the source and the target data share exactly the same classes of objects. In this paper, we tackle a more realistic problem of open-set domain shift where the target data contains additional classes that are not present in the source data. More specifically, we introduce an end-to-end Progressive Graph Learning (PGL) framework where a graph neural network with episodic training is integrated to suppress underlying conditional shift and adversarial learning is adopted to close the gap between the source and target distributions. Compared to the existing open-set adaptation approaches, our approach guarantees to achieve a tighter upper bound of the target error. Extensive experiments on three standard open-set benchmarks evidence that our approach significantly outperforms the state-of-the-arts in open-set domain adaptation.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.