Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
Modern SAT or QBF solvers are expected to produce correctness certificates. However, certificates have worst-case exponential size (unless $\textsf{NP}=\textsf{coNP}$), and at recent SAT competitions the largest certificates of unsatisfiability are starting to reach terabyte size. Recently, Couillard, Czerner, Esparza, and Majumdar have suggested to replace certificates with interactive proof systems based on the $\textsf{IP}=\textsf{PSPACE}$ theorem. They have presented an interactive protocol between a prover and a verifier for an extension of QBF. The overall running time of the protocol is linear in the time needed by a standard BDD-based algorithm, and the time invested by the verifier is polynomial in the size of the formula. (So, in particular, the verifier never has to read or process exponentially long certificates). We call such an interactive protocol competitive with the BDD algorithm for solving QBF. While BDD-algorithms are state-of-the-art for certain classes of QBF instances, no modern (UN)SAT solver is based on BDDs. For this reason, we initiate the study of interactive certification for more practical SAT algorithms. In particular, we address the question whether interactive protocols can be competitive with some variant of resolution. We present two contributions. First, we prove a theorem that reduces the problem of finding competitive interactive protocols to finding an arithmetisation of formulas satisfying certain commutativity properties. (Arithmetisation is the fundamental technique underlying the $\textsf{IP}=\textsf{PSPACE}$ theorem.) Then, we apply the theorem to give the first interactive protocol for the Davis-Putnam resolution procedure.
Isocontouring is one of the most widely used visualization techniques. However, many popular contouring algorithms were created prior to the advent of ubiquitous parallel approaches, such as multi-core, shared memory computing systems. With increasing data sizes and computational loads, it is essential to reimagine such algorithms to leverage the increased computing capabilities available today. To this end we have redesigned the SurfaceNets algorithm, a powerful technique which is often employed to isocontour non-continuous, discrete, volumetric scalar fields such as segmentation label maps. Label maps are ubiquitous to medical computing and biological analysis, used in applications ranging from anatomical atlas creation to brain connectomics. This novel Parallel SurfaceNets algorithm has been redesigned using concepts from the high-performance Flying Edges continuous isocontouring algorrithm. It consists of two basic steps, surface extraction followed by constrained smoothing, parallelized over volume edges and employing a double-buffering smoothing approach to guarantee determinism. The algorithm can extract and smooth multiple segmented objects in a single execution, producing a polygonal (triangular/quadrilateral) mesh with points and polygons fully shared between neighboring objects. Performance is typically one to two orders of magnitude faster than the current sequential algorithms for discrete isosurface extraction on small core-count commodity CPU hardware. We demonstrate the effectiveness of the algorithm on five different datasets including human torso and brain atlases, mouse brain segmentation, and electron microscopy connectomics. The software is currently available under a permissive, open source license in the VTK visualization system.
This work proposes a challenging yet more realistic setting for zero-shot cross-task generalization: zero-shot instruction following, presuming the existence of a paragraph-style task definition while no demonstrations exist. To better learn the task supervision from the definition, we propose two strategies: first, to automatically find out the critical sentences in the definition; second, a ranking objective to force the model to generate the gold outputs with higher probabilities when those critical parts are highlighted in the definition. The joint efforts of the two strategies yield state-of-the-art performance on the Super-NaturalInstructions. Our code is available on GitHub.
As one of the most popular dynamic languages, Python experiences a decrease in readability and maintainability when code smells are present. Recent advancements in Large Language Models have sparked growing interest in AI-enabled tools for both code generation and refactoring. GitHub Copilot is one such tool that has gained widespread usage. Copilot Chat, released on September 2023, functions as an interactive tool aims at facilitating natural language-powered coding. However, limited attention has been given to understanding code smells in Copilot-generated Python code and Copilot's ability to fix the code smells it generates. To this end, we built a dataset comprising 102 code smells in Copilot-generated Python code. Our aim is to first explore the occurrence of code smells in Copilot-generated Python code and then evaluate the effectiveness of Copilot in fixing these code smells employing different prompts. The results show that 8 out of 10 types of Python smells can be detected in Copilot-generated Python code, among which Multiply-Nested Container is the most common one. For these code smells, Copilot Chat achieves a highest fixing rate of 87.1%, showing promise in fixing Python code smells generated by Copilot itself. Besides, the effectiveness of Copilot Chat in fixing these smells can be improved with the provision of more detailed prompts. However, using Copilot Chat to fix these smells might introduce new code smells.
Recent advancement in computer vision has significantly lowered the barriers to artistic creation. Exemplar-based image translation methods have attracted much attention due to flexibility and controllability. However, these methods hold assumptions regarding semantics or require semantic information as the input, while accurate semantics is not easy to obtain in artistic images. Besides, these methods suffer from cross-domain artifacts due to training data prior and generate imprecise structure due to feature compression in the spatial domain. In this paper, we propose an arbitrary Style Image Manipulation Network (SIM-Net), which leverages semantic-free information as guidance and a region transportation strategy in a self-supervised manner for image generation. Our method balances computational efficiency and high resolution to a certain extent. Moreover, our method facilitates zero-shot style image manipulation. Both qualitative and quantitative experiments demonstrate the superiority of our method over state-of-the-art methods.Code is available at //github.com/SnailForce/SIM-Net.
Efficient implementation of massive multiple-input-multiple-output (MIMO) transceivers is essential for the next-generation wireless networks. To reduce the high computational complexity of the massive MIMO transceiver, in this paper, we propose a new massive MIMO architecture using finite-precision arithmetic. First, we conduct the rounding error analysis and derive the lower bound of the achievable rate for single-input-multiple-output (SIMO) using maximal ratio combining (MRC) and multiple-input-single-output (MISO) systems using maximal ratio transmission (MRT) with finite-precision arithmetic. Then, considering the multi-user scenario, the rounding error analysis of zero-forcing (ZF) detection and precoding is derived by using the normal equations (NE) method. The corresponding lower bounds of the achievable sum rate are also derived and asymptotic analyses are presented. Built upon insights from these analyses and lower bounds, we propose a mixed-precision architecture for massive MIMO systems to offset performance gaps due to finite-precision arithmetic. The corresponding analysis of rounding errors and computational costs is obtained. Simulation results validate the derived bounds and underscore the superiority of the proposed mixed-precision architecture to the conventional structure.
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-text associations, existing methods resort to joint training on both source and target domain videos for cross-domain applications. Meanwhile, recent developments in vision-language multimodal models pre-trained on large-scale image-text and/or video-text pairs are only based on coarse associations (weakly labelled). They are inadequate to provide fine-grained moment-text correlations required for cross-domain VMR. In this work, we solve the problem of unseen cross-domain VMR, where certain visual and textual concepts do not overlap across domains, by only utilising target domain sentences (text prompts) without accessing their videos. To that end, we explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences, enabling us to simulate target domain videos. We address two problems in video editing for optimising unseen domain VMR: (1) generation of high-quality simulation videos of different moments with subtle distinctions, (2) selection of simulation videos that complement existing source training videos without introducing harmful noise or unnecessary repetitions. On the first problem, we formulate a two-stage video diffusion generation controlled simultaneously by (1) the original video structure of a source video, (2) subject specifics, and (3) a target sentence prompt. This ensures fine-grained variations between video moments. On the second problem, we introduce a hybrid selection mechanism that combines two quantitative metrics for noise filtering and one qualitative metric for leveraging VMR prediction on simulation video selection.
Common crowdsourcing systems average estimates of a latent quantity of interest provided by many crowdworkers to produce a group estimate. We develop a new approach -- just-predict-others -- that leverages self-supervised learning and a novel aggregation scheme. This approach adapts weights assigned to crowdworkers based on estimates they provided for previous quantities. When skills vary across crowdworkers or their estimates correlate, the weighted sum offers a more accurate group estimate than the average. Existing algorithms such as expectation maximization can, at least in principle, produce similarly accurate group estimates. However, their computational requirements become onerous when complex models, such as neural networks, are required to express relationships among crowdworkers. Just-predict-others accommodates such complexity as well as many other practical challenges. We analyze the efficacy of just-predict-others through theoretical and computational studies. Among other things, we establish asymptotic optimality as the number of engagements per crowdworker grows.
Learning with limited data is a key challenge for visual recognition. Few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them is the target task. In this paper, we propose a novel approach to adapt the embedding model to the target classification task, yielding embeddings that are task-specific and are discriminative. To this end, we employ a type of self-attention mechanism called Transformer to transform the embeddings from task-agnostic to task-specific by focusing on relating instances from the test instances to the training instances in both seen and unseen classes. Our approach also extends to both transductive and generalized few-shot classification, two important settings that have essential use cases. We verify the effectiveness of our model on two standard benchmark few-shot classification datasets --- MiniImageNet and CUB, where our approach demonstrates state-of-the-art empirical performance.
Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.