The inherent challenge of image fusion lies in capturing the correlation of multi-source images and comprehensively integrating effective information from different sources. Most existing techniques fail to perform dynamic image fusion while notably lacking theoretical guarantees, leading to potential deployment risks in this field. Is it possible to conduct dynamic image fusion with a clear theoretical justification? In this paper, we give our solution from a generalization perspective. We proceed to reveal the generalized form of image fusion and derive a new test-time dynamic image fusion paradigm. It provably reduces the upper bound of generalization error. Specifically, we decompose the fused image into multiple components corresponding to its source data. The decomposed components represent the effective information from the source data, thus the gap between them reflects the Relative Dominability (RD) of the uni-source data in constructing the fusion image. Theoretically, we prove that the key to reducing generalization error hinges on the negative correlation between the RD-based fusion weight and the uni-source reconstruction loss. Intuitively, RD dynamically highlights the dominant regions of each source and can be naturally converted to the corresponding fusion weight, achieving robust results. Extensive experiments and discussions with in-depth analysis on multiple benchmarks confirm our findings and superiority. Our code is available at //github.com/Yinan-Xia/TTD.
Machine unlearning without access to real data distribution is challenging. The existing method based on data-free distillation achieved unlearning by filtering out synthetic samples containing forgetting information but struggled to distill the retaining-related knowledge efficiently. In this work, we analyze that such a problem is due to over-filtering, which reduces the synthesized retaining-related information. We propose a novel method, Inhibited Synthetic PostFilter (ISPF), to tackle this challenge from two perspectives: First, the Inhibited Synthetic, by reducing the synthesized forgetting information; Second, the PostFilter, by fully utilizing the retaining-related information in synthesized samples. Experimental results demonstrate that the proposed ISPF effectively tackles the challenge and outperforms existing methods.
With the rising imaging resolution of handheld devices, existing multi-exposure image fusion algorithms struggle to generate a high dynamic range image with ultra-high resolution in real-time. Apart from that, there is a trend to design a manageable and editable algorithm as the different needs of real application scenarios. To tackle these issues, we introduce 3D LUT technology, which can enhance images with ultra-high-definition (UHD) resolution in real time on resource-constrained devices. However, since the fusion of information from multiple images with different exposure rates is uncertain, and this uncertainty significantly trials the generalization power of the 3D LUT grid. To address this issue and ensure a robust learning space for the model, we propose using a teacher-student network to model the uncertainty on the 3D LUT grid.Furthermore, we provide an editable mode for the multi-exposure image fusion algorithm by using the implicit representation function to match the requirements in different scenarios. Extensive experiments demonstrate that our proposed method is highly competitive in efficiency and accuracy.
Novel data sources bring new opportunities to improve the quality of recommender systems and serve as a catalyst for the creation of new paradigms on personalized recommendations. Impressions are a novel data source containing the items shown to users on their screens. Past research focused on providing personalized recommendations using interactions, and occasionally using impressions when such a data source was available. Interest in impressions has increased due to their potential to provide more accurate recommendations. Despite this increased interest, research in recommender systems using impressions is still dispersed. Many works have distinct interpretations of impressions and use impressions in recommender systems in numerous different manners. To unify those interpretations into a single framework, we present a systematic literature review on recommender systems using impressions, focusing on three fundamental perspectives: recommendation models, datasets, and evaluation methodologies. We define a theoretical framework to delimit recommender systems using impressions and a novel paradigm for personalized recommendations, called impression-aware recommender systems. We propose a classification system for recommenders in this paradigm, which we use to categorize the recommendation models, datasets, and evaluation methodologies used in past research. Lastly, we identify open questions and future directions, highlighting missing aspects in the reviewed literature.
Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial application.To address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: //zhuang2002.github.io/ColorFlow/.
High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is widely adopted in the area of autonomous driving. We first unify the data representation for both modalities to facilitate data processing. Then, we propose the point cloud-assisted neural image codec (PCA-NIC) to enhance the preservation of image texture and structure by utilizing the high-dimensional point cloud information. We further introduce a multi-modal feature fusion transform module (MMFFT) to capture more representative image features, remove redundant information between channels and modalities that are not relevant to the image content. Our work is the first to improve image compression performance using point cloud and achieves state-of-the-art performance.
Photometric stereo (PS) endeavors to ascertain surface normals using shading clues from photometric images under various illuminations. Recent deep learning-based PS methods often overlook the complexity of object surfaces. These neural network models, which exclusively rely on photometric images for training, often produce blurred results in high-frequency regions characterized by local discontinuities, such as wrinkles and edges with significant gradient changes. To address this, we propose the Image Gradient-Aided Photometric Stereo Network (IGA-PSN), a dual-branch framework extracting features from both photometric images and their gradients. Furthermore, we incorporate an hourglass regression network along with supervision to regularize normal regression. Experiments on DiLiGenT benchmarks show that IGA-PSN outperforms previous methods in surface normal estimation, achieving a mean angular error of 6.46 while preserving textures and geometric shapes in complex regions.
The rapid development of image generation models has facilitated the widespread dissemination of generated images on social networks, creating favorable conditions for provably secure image steganography. However, existing methods face issues such as low quality of generated images and lack of semantic control in the generation process. To leverage provably secure steganography with more effective and high-performance image generation models, and to ensure that stego images can accurately extract secret messages even after being uploaded to social networks and subjected to lossy processing such as JPEG compression, we propose a high-quality, provably secure, and robust image steganography method based on state-of-the-art autoregressive (AR) image generation models using Vector-Quantized (VQ) tokenizers. Additionally, we employ a cross-modal error-correction framework that generates stego text from stego images to aid in restoring lossy images, ultimately enabling the extraction of secret messages embedded within the images. Extensive experiments have demonstrated that the proposed method provides advantages in stego quality, embedding capacity, and robustness, while ensuring provable undetectability.
This paper presents a novel approach to enhance communication efficiency in federated learning through clipped uniform quantization. By leveraging optimal clipping thresholds and client-specific adaptive quantization schemes, the proposed method significantly reduces bandwidth and memory requirements for model weight transmission between clients and the server while maintaining competitive accuracy. We investigate the effects of symmetric clipping and uniform quantization on model performance, emphasizing the role of stochastic quantization in mitigating artifacts and improving robustness. Extensive simulations demonstrate that the method achieves near-full-precision performance with substantial communication savings. Moreover, the proposed approach facilitates efficient weight averaging based on the inverse of the mean squared quantization errors, effectively balancing the trade-off between communication efficiency and model accuracy. Moreover, in contrast to federated averaging, this design obviates the need to disclose client-specific data volumes to the server, thereby enhancing client privacy. Comparative analysis with conventional quantization methods further confirms the efficacy of the proposed scheme.
Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.
The low resolution of objects of interest in aerial images makes pedestrian detection and action detection extremely challenging tasks. Furthermore, using deep convolutional neural networks to process large images can be demanding in terms of computational requirements. In order to alleviate these challenges, we propose a two-step, yes and no question answering framework to find specific individuals doing one or multiple specific actions in aerial images. First, a deep object detector, Single Shot Multibox Detector (SSD), is used to generate object proposals from small aerial images. Second, another deep network, is used to learn a latent common sub-space which associates the high resolution aerial imagery and the pedestrian action labels that are provided by the human-based sources