Differentiable Architecture Search (DARTS) is a simple yet efficient Neural Architecture Search (NAS) method. During the search stage, DARTS trains a supernet by jointly optimizing architecture parameters and network parameters. During the evaluation stage, DARTS discretizes the supernet to derive the optimal architecture based on architecture parameters. However, recent research has shown that during the training process, the supernet tends to converge towards sharp minima rather than flat minima. This is evidenced by the higher sharpness of the loss landscape of the supernet, which ultimately leads to a performance gap between the supernet and the optimal architecture. In this paper, we propose Self-Distillation Differentiable Neural Architecture Search (SD-DARTS) to alleviate the discretization gap. We utilize self-distillation to distill knowledge from previous steps of the supernet to guide its training in the current step, effectively reducing the sharpness of the supernet's loss and bridging the performance gap between the supernet and the optimal architecture. Furthermore, we introduce the concept of voting teachers, where multiple previous supernets are selected as teachers, and their output probabilities are aggregated through voting to obtain the final teacher prediction. Experimental results on real datasets demonstrate the advantages of our novel self-distillation-based NAS method compared to state-of-the-art alternatives.
Although face anti-spoofing (FAS) methods have achieved remarkable performance on specific domains or attack types, few studies have focused on the simultaneous presence of domain changes and unknown attacks, which is closer to real application scenarios. To handle domain-generalized unknown attacks, we introduce a new method, DGUA-FAS, which consists of a Transformer-based feature extractor and a synthetic unknown attack sample generator (SUASG). The SUASG network simulates unknown attack samples to assist the training of the feature extractor. Experimental results show that our method achieves superior performance on domain generalization FAS with known or unknown attacks.
In recent years, Video Object Segmentation (VOS) has emerged as a complementary method to Video Object Tracking (VOT). VOS focuses on classifying all the pixels around the target, allowing for precise shape labeling, while VOT primarily focuses on the approximate region where the target might be. However, traditional segmentation modules usually classify pixels frame by frame, disregarding information between adjacent frames. In this paper, we propose a new algorithm that addresses this limitation by analyzing the motion pattern using the inherent tensor structure. The tensor structure, obtained through Tucker2 tensor decomposition, proves to be effective in describing the target's motion. By incorporating this information, we achieved competitive results on Four benchmarks LaSOT\cite{fan2019lasot}, AVisT\cite{noman2022avist}, OTB100\cite{7001050}, and GOT-10k\cite{huang2019got} LaSOT\cite{fan2019lasot} with SOTA. Furthermore, the proposed tracker is capable of real-time operation, adding value to its practical application.
Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks.
We investigate auction mechanisms to support the emerging format of AI-generated content. We in particular study how to aggregate several LLMs in an incentive compatible manner. In this problem, the preferences of each agent over stochastically generated contents are described/encoded as an LLM. A key motivation is to design an auction format for AI-generated ad creatives to combine inputs from different advertisers. We argue that this problem, while generally falling under the umbrella of mechanism design, has several unique features. We propose a general formalism -- the token auction model -- for studying this problem. A key feature of this model is that it acts on a token-by-token basis and lets LLM agents influence generated contents through single dimensional bids. We first explore a robust auction design approach, in which all we assume is that agent preferences entail partial orders over outcome distributions. We formulate two natural incentive properties, and show that these are equivalent to a monotonicity condition on distribution aggregation. We also show that for such aggregation functions, it is possible to design a second-price auction, despite the absence of bidder valuation functions. We then move to designing concrete aggregation functions by focusing on specific valuation forms based on KL-divergence, a commonly used loss function in LLM. The welfare-maximizing aggregation rules turn out to be the weighted (log-space) convex combination of the target distributions from all participants. We conclude with experimental results in support of the token auction formulation.
Due to the powerful edge-preserving ability and low computational complexity, Guided image filter (GIF) and its improved versions has been widely applied in computer vision and image processing. However, all of them are suffered halo artifacts to some degree, as the regularization parameter increase. In the case of inconsistent structure of guidance image and input image, edge-preserving ability degradation will also happen. In this paper, a novel guided image filter is proposed by integrating an explicit first-order edge-protect constraint and an explicit residual constraint which will improve the edge-preserving ability in both cases. To illustrate the efficiency of the proposed filter, the performances are shown in some typical applications, which are single image detail enhancement, multi-scale exposure fusion, hyper spectral images classification. Both theoretical analysis and experimental results prove that the powerful edge-preserving ability of the proposed filter.
This paper aims to explore the potential of combining Deep Reinforcement Learning (DRL) with Knowledge Distillation (KD) by distilling various DRL algorithms and studying their distillation effects. By doing so, the computational burden of deep models could be reduced while maintaining the performance. The primary objective is to provide a benchmark for evaluating the performance of different DRL algorithms that have been refined using KD techniques. By distilling these algorithms, the goal is to develop efficient and fast DRL models. This research is expected to provide valuable insights that can facilitate further advancements in this promising direction. By exploring the combination of DRL and KD, this work aims to promote the development of models that require fewer GPU resources, learn more quickly, and make faster decisions in complex environments. The results of this research have the capacity to significantly advance the field of DRL and pave the way for the future deployment of resource-efficient, decision-making intelligent systems.
Diffusion models (DMs) have demonstrated advantageous potential on generative tasks. Widespread interest exists in incorporating DMs into downstream applications, such as producing or editing photorealistic images. However, practical deployment and unprecedented power of DMs raise legal issues, including copyright protection and monitoring of generated content. In this regard, watermarking has been a proven solution for copyright protection and content monitoring, but it is underexplored in the DMs literature. Specifically, DMs generate samples from longer tracks and may have newly designed multimodal structures, necessitating the modification of conventional watermarking pipelines. To this end, we conduct comprehensive analyses and derive a recipe for efficiently watermarking state-of-the-art DMs (e.g., Stable Diffusion), via training from scratch or finetuning. Our recipe is straightforward but involves empirically ablated implementation details, providing a foundation for future research on watermarking DMs. The code is available at //github.com/yunqing-me/WatermarkDM.
Large Language models (LLMs) possess the capability to engage In-context Learning (ICL) by leveraging a few demonstrations pertaining to a new downstream task as conditions. However, this particular learning paradigm suffers from high instability stemming from substantial variances induced by factors such as the input distribution of selected examples, their ordering, and prompt formats. In this work, we demonstrate that even when all these factors are held constant, the random selection of examples still results in high variance. Consequently, we aim to explore the informative ability of data examples by quantifying the Information Gain (IG) obtained in prediction after observing a given example candidate. Then we propose to sample those with maximum IG. Additionally, we identify the presence of template bias, which can lead to unfair evaluations of IG during the sampling process. To mitigate this bias, we introduce Calibration Before Sampling strategy. The experimental results illustrate that our proposed method can yield an average relative improvement of 14.3% across six classification tasks using three LLMs.
Label Propagation (LPA) and Graph Convolutional Neural Networks (GCN) are both message passing algorithms on graphs. Both solve the task of node classification but LPA propagates node label information across the edges of the graph, while GCN propagates and transforms node feature information. However, while conceptually similar, theoretical relation between LPA and GCN has not yet been investigated. Here we study the relationship between LPA and GCN in terms of two aspects: (1) feature/label smoothing where we analyze how the feature/label of one node is spread over its neighbors; And, (2) feature/label influence of how much the initial feature/label of one node influences the final feature/label of another node. Based on our theoretical analysis, we propose an end-to-end model that unifies GCN and LPA for node classification. In our unified model, edge weights are learnable, and the LPA serves as regularization to assist the GCN in learning proper edge weights that lead to improved classification performance. Our model can also be seen as learning attention weights based on node labels, which is more task-oriented than existing feature-based attention models. In a number of experiments on real-world graphs, our model shows superiority over state-of-the-art GCN-based methods in terms of node classification accuracy.
We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.