When deploying pre-trained video object detectors in real-world scenarios, the domain gap between training and testing data caused by adverse image conditions often leads to performance degradation. Addressing this issue becomes particularly challenging when only the pre-trained model and degraded videos are available. Although various source-free domain adaptation (SFDA) methods have been proposed for single-frame object detectors, SFDA for video object detection (VOD) remains unexplored. Moreover, most unsupervised domain adaptation works for object detection rely on two-stage detectors, while SFDA for one-stage detectors, which are more vulnerable to fine-tuning, is not well addressed in the literature. In this paper, we propose Spatial-Temporal Alternate Refinement with Mean Teacher (STAR-MT), a simple yet effective SFDA method for VOD. Specifically, we aim to improve the performance of the one-stage VOD method, YOLOV, under adverse image conditions, including noise, air turbulence, and haze. Extensive experiments on the ImageNetVOD dataset and its degraded versions demonstrate that our method consistently improves video object detection performance in challenging imaging conditions, showcasing its potential for real-world applications.
With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.
It is challenging for the mobile robot to achieve autonomous and mapless navigation in the unknown environment with uneven terrain. In this study, we present a layered and systematic pipeline. At the local level, we maintain a tree structure that is dynamically extended with the navigation. This structure unifies the planning with the terrain identification. Besides, it contributes to explicitly identifying the hazardous areas on uneven terrain. In particular, certain nodes of the tree are consistently kept to form a sparse graph at the global level, which records the history of the exploration. A series of subgoals that can be obtained in the tree and the graph are utilized for leading the navigation. To determine a subgoal, we develop an evaluation method whose input elements can be efficiently obtained on the layered structure. We conduct both simulation and real-world experiments to evaluate the developed method and its key modules. The experimental results demonstrate the effectiveness and efficiency of our method. The robot can travel through the unknown uneven region safely and reach the target rapidly without a preconstructed map.
Time series data in real-world scenarios contain a substantial amount of nonlinear information, which significantly interferes with the training process of models, leading to decreased prediction performance. Therefore, during the time series forecasting process, extracting the local and global time series patterns and understanding the potential nonlinear features among different time observations are highly significant. To address this challenge, we introduce multi-resolution convolution and deformable convolution operations. By enlarging the receptive field using convolution kernels with different dilation factors to capture temporal correlation information at different resolutions, and adaptively adjusting the sampling positions through additional offset vectors, we enhance the network's ability to capture potential nonlinear features among time observations. Building upon this, we propose ACNet, an adaptive convolutional network designed to effectively model the local and global temporal dependencies and the nonlinear features between observations in multivariate time series. Specifically, by extracting and fusing time series features at different resolutions, we capture both local contextual information and global patterns in the time series. The designed nonlinear feature adaptive extraction module captures the nonlinear features among different time observations in the time series. We evaluated the performance of ACNet across twelve real-world datasets. The results indicate that ACNet consistently achieves state-of-the-art performance in both short-term and long-term forecasting tasks with favorable runtime efficiency.
Machine unlearning strives to uphold the data owners' right to be forgotten by enabling models to selectively forget specific data. Recent advances suggest precomputing and storing statistics extracted from second-order information and implementing unlearning through Newton-style updates. However, the theoretical analysis of these works often depends on restrictive assumptions of convexity and smoothness, and those mentioned operations on Hessian matrix are extremely costly. As a result, applying these works to high-dimensional models becomes challenging. In this paper, we propose an efficient Hessian-free certified unlearning. We propose to maintain a statistical vector for each data, computed through affine stochastic recursion approximation of the difference between retrained and learned models. Our analysis does not involve inverting Hessian and thus can be extended to non-convex non-smooth objectives. Under same assumptions, we demonstrate advancements of proposed method beyond the state-of-the-art theoretical studies, in terms of generalization, unlearning guarantee, deletion capacity, and computation/storage complexity, and we show that the unlearned model of our proposed approach is close to or same as the retrained model. Based on the strategy of recollecting statistics for forgetting data, we develop an algorithm that achieves near-instantaneous unlearning as it only requires a vector addition operation. Experiments demonstrate that the proposed scheme surpasses existing results by orders of magnitude in terms of time/storage costs, while also enhancing accuracy.
Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, Control-GIC, the first capable of fine-grained bitrate adaption across a broad spectrum while ensuring high-fidelity and generality compression. We base Control-GIC on a VQGAN framework representing an image as a sequence of variable-length codes (i.e. VQ-indices), which can be losslessly compressed and exhibits a direct positive correlation with the bitrates. Therefore, drawing inspiration from the classical coding principle, we naturally correlate the information density of local image patches with their granular representations, to achieve dynamic adjustment of the code quantity following different granularity decisions. This implies we can flexibly determine a proper allocation of granularity for the patches to acquire desirable compression rates. We further develop a probabilistic conditional decoder that can trace back to historic encoded multi-granularity representations according to transmitted codes, and then reconstruct hierarchical granular features in the formalization of conditional probability, enabling more informative aggregation to improve reconstruction realism. Our experiments show that Control-GIC allows highly flexible and controllable bitrate adaption and even once compression on an entire dataset to fulfill constrained bitrate conditions. Experimental results demonstrate its superior performance over recent state-of-the-art methods.
Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.
In semi-supervised domain adaptation, a few labeled samples per class in the target domain guide features of the remaining target samples to aggregate around them. However, the trained model cannot produce a highly discriminative feature representation for the target domain because the training data is dominated by labeled samples from the source domain. This could lead to disconnection between the labeled and unlabeled target samples as well as misalignment between unlabeled target samples and the source domain. In this paper, we propose a novel approach called Cross-domain Adaptive Clustering to address this problem. To achieve both inter-domain and intra-domain adaptation, we first introduce an adversarial adaptive clustering loss to group features of unlabeled target data into clusters and perform cluster-wise feature alignment across the source and target domains. We further apply pseudo labeling to unlabeled samples in the target domain and retain pseudo-labels with high confidence. Pseudo labeling expands the number of ``labeled" samples in each class in the target domain, and thus produces a more robust and powerful cluster core for each class to facilitate adversarial learning. Extensive experiments on benchmark datasets, including DomainNet, Office-Home and Office, demonstrate that our proposed approach achieves the state-of-the-art performance in semi-supervised domain adaptation.
Neural network models usually suffer from the challenge of incorporating commonsense knowledge into the open-domain dialogue systems. In this paper, we propose a novel knowledge-aware dialogue generation model (called TransDG), which transfers question representation and knowledge matching abilities from knowledge base question answering (KBQA) task to facilitate the utterance understanding and factual knowledge selection for dialogue generation. In addition, we propose a response guiding attention and a multi-step decoding strategy to steer our model to focus on relevant features for response generation. Experiments on two benchmark datasets demonstrate that our model has robust superiority over compared methods in generating informative and fluent dialogues. Our code is available at //github.com/siat-nlp/TransDG.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.
In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.