In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features learned by the transformer. Detection and segmentation of salient objects is then formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm. Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6%, respectively, when tested with the VOC07, VOC12, and COCO20K datasets. For the unsupervised saliency detection task in images, this method improves the score for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the ECSSD, DUTS, and DUT-OMRON datasets, respectively, compared to current state-of-the-art techniques. This method also achieves competitive results for unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS datasets.
There is a growing consensus in the research community that the optimization of low-light image enhancement approaches should be guided by the visual quality perceived by end users. Despite the substantial efforts invested in the design of low-light enhancement algorithms, there has been comparatively limited focus on assessing subjective and objective quality systematically. To mitigate this gap and provide a clear path towards optimizing low-light image enhancement for better visual quality, we propose a gap-closing framework. In particular, our gap-closing framework starts with the creation of a large-scale dataset for Subjective QUality Assessment of REconstructed LOw-Light Images (SQUARE-LOL). This database serves as the foundation for studying the quality of enhanced images and conducting a comprehensive subjective user study. Subsequently, we propose an objective quality assessment measure that plays a critical role in bridging the gap between visual quality and enhancement. Finally, we demonstrate that our proposed objective quality measure can be incorporated into the process of optimizing the learning of the enhancement model toward perceptual optimality. We validate the effectiveness of our proposed framework through both the accuracy of quality prediction and the perceptual quality of image enhancement. Our database and code will be made publicly available to facilitate further research in this area.
In this work, we introduce a novel approach to programming education - in-IDE courses implemented for IntelliJ-based IDEs via the JetBrains Academy Plugin. The primary objective of this approach is to address the challenge of familiarizing students with industrial technologies by moving all theory and practical materials to a professional IDE. This approach allows students to immediately use modern industrial tools as they are fully integrated into the learning process. We have already applied this approach in over 40 courses, and it successfully educates students across diverse topics such as Plugin Development, Algorithms, Data Analysis, and Language mastery in various programming languages, including Kotlin, Java, C++, and Python. Along with the paper, we are providing the community not only with a new way of learning and a set of ready-made courses but also a collection of helpful resources to assist educators in getting started with the plugin. Finally, we describe in detail an IDE plugin development course that demonstrates how the in-IDE approach covers complex topics easily.
The ability to fine-tune generative models for text-to-image generation tasks is crucial, particularly facing the complexity involved in accurately interpreting and visualizing textual inputs. While LoRA is efficient for language model adaptation, it often falls short in text-to-image tasks due to the intricate demands of image generation, such as accommodating a broad spectrum of styles and nuances. To bridge this gap, we introduce StyleInject, a specialized fine-tuning approach tailored for text-to-image models. StyleInject comprises multiple parallel low-rank parameter matrices, maintaining the diversity of visual features. It dynamically adapts to varying styles by adjusting the variance of visual features based on the characteristics of the input signal. This approach significantly minimizes the impact on the original model's text-image alignment capabilities while adeptly adapting to various styles in transfer learning. StyleInject proves particularly effective in learning from and enhancing a range of advanced, community-fine-tuned generative models. Our comprehensive experiments, including both small-sample and large-scale data fine-tuning as well as base model distillation, show that StyleInject surpasses traditional LoRA in both text-image semantic consistency and human preference evaluation, all while ensuring greater parameter efficiency.
In this paper, we propose an online-matching-based model to tackle the two fundamental issues, matching and pricing, existing in a wide range of real-world gig platforms, including ride-hailing (matching riders and drivers), crowdsourcing markets (pairing workers and tasks), and online recommendations (offering items to customers). Our model assumes the arriving distributions of dynamic agents (e.g., riders, workers, and buyers) are accessible in advance, and they can change over time, which is referred to as \emph{Known Heterogeneous Distributions} (KHD). In this paper, we initiate variance analysis for online matching algorithms under KHD. Unlike the popular competitive-ratio (CR) metric, the variance of online algorithms' performance is rarely studied due to inherent technical challenges, though it is well linked to robustness. We focus on two natural parameterized sampling policies, denoted by $\mathsf{ATT}(\gamma)$ and $\mathsf{SAMP}(\gamma)$, which appear as foundational bedrock in online algorithm design. We offer rigorous competitive ratio (CR) and variance analyses for both policies. Specifically, we show that $\mathsf{ATT}(\gamma)$ with $\gamma \in [0,1/2]$ achieves a CR of $\gamma$ and a variance of $\gamma \cdot (1-\gamma) \cdot B$ on the total number of matches with $B$ being the total matching capacity. In contrast, $\mathsf{SAMP}(\gamma)$ with $\gamma \in [0,1]$ accomplishes a CR of $\gamma (1-\gamma)$ and a variance of $\bar{\gamma} (1-\bar{\gamma})\cdot B$ with $\bar{\gamma}=\min(\gamma,1/2)$. All CR and variance analyses are tight and unconditional of any benchmark. As a byproduct, we prove that $\mathsf{ATT}(\gamma=1/2)$ achieves an optimal CR of $1/2$.
In this paper, we present a novel deep image clustering approach termed PICI, which enforces the partial information discrimination and the cross-level interaction in a joint learning framework. In particular, we leverage a Transformer encoder as the backbone, through which the masked image modeling with two paralleled augmented views is formulated. After deriving the class tokens from the masked images by the Transformer encoder, three partial information learning modules are further incorporated, including the PISD module for training the auto-encoder via masked image reconstruction, the PICD module for employing two levels of contrastive learning, and the CLI module for mutual interaction between the instance-level and cluster-level subspaces. Extensive experiments have been conducted on six real-world image datasets, which demononstrate the superior clustering performance of the proposed PICI approach over the state-of-the-art deep clustering approaches. The source code is available at //github.com/Regan-Zhang/PICI.
Prompt design and engineering has become an important discipline in just the past few months. In this paper, we provide an introduction to the main concepts as well as review basic and more advanced approaches to prompt design and engineering.
Structural extraction of events within discourse is critical since it avails a deeper understanding of communication patterns and behavior trends. Event argument extraction (EAE), at the core of event-centric understanding, is the task of identifying role-specific text spans (i.e., arguments) for a given event. Document-level EAE (DocEAE) focuses on arguments that are scattered across an entire document. In this work, we explore the capabilities of open source Large Language Models (LLMs), i.e., Flan-UL2, for the DocEAE task. To this end, we propose ULTRA, a hierarchical framework that extracts event arguments more cost-effectively -- the method needs as few as 50 annotations and doesn't require hitting costly API endpoints. Further, it alleviates the positional bias issue intrinsic to LLMs. ULTRA first sequentially reads text chunks of a document to generate a candidate argument set, upon which ULTRA learns to drop non-pertinent candidates through self-refinement. We further introduce LEAFER to address the challenge LLMs face in locating the exact boundary of an argument span. ULTRA outperforms strong baselines, which include strong supervised models and ChatGPT, by 9.8% when evaluated by the exact match (EM) metric.
Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8$\times$ faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available //github.com/google-research/nested-transformer.
Knowledge graphs are important resources for many artificial intelligence tasks but often suffer from incompleteness. In this work, we propose to use pre-trained language models for knowledge graph completion. We treat triples in knowledge graphs as textual sequences and propose a novel framework named Knowledge Graph Bidirectional Encoder Representations from Transformer (KG-BERT) to model these triples. Our method takes entity and relation descriptions of a triple as input and computes scoring function of the triple with the KG-BERT language model. Experimental results on multiple benchmark knowledge graphs show that our method can achieve state-of-the-art performance in triple classification, link prediction and relation prediction tasks.
Many tasks in natural language processing can be viewed as multi-label classification problems. However, most of the existing models are trained with the standard cross-entropy loss function and use a fixed prediction policy (e.g., a threshold of 0.5) for all the labels, which completely ignores the complexity and dependencies among different labels. In this paper, we propose a meta-learning method to capture these complex label dependencies. More specifically, our method utilizes a meta-learner to jointly learn the training policies and prediction policies for different labels. The training policies are then used to train the classifier with the cross-entropy loss function, and the prediction policies are further implemented for prediction. Experimental results on fine-grained entity typing and text classification demonstrate that our proposed method can obtain more accurate multi-label classification results.