Multi-modal image fusion (MMIF) integrates valuable information from different modality images into a fused one. However, the fusion of multiple visible images with different focal regions and infrared images is a unprecedented challenge in real MMIF applications. This is because of the limited depth of the focus of visible optical lenses, which impedes the simultaneous capture of the focal information within the same scene. To address this issue, in this paper, we propose a MMIF framework for joint focused integration and modalities information extraction. Specifically, a semi-sparsity-based smoothing filter is introduced to decompose the images into structure and texture components. Subsequently, a novel multi-scale operator is proposed to fuse the texture components, capable of detecting significant information by considering the pixel focus attributes and relevant data from various modal images. Additionally, to achieve an effective capture of scene luminance and reasonable contrast maintenance, we consider the distribution of energy information in the structural components in terms of multi-directional frequency variance and information entropy. Extensive experiments on existing MMIF datasets, as well as the object detection and depth estimation tasks, consistently demonstrate that the proposed algorithm can surpass the state-of-the-art methods in visual perception and quantitative evaluation. The code is available at //github.com/ixilai/MFIF-MMIF.
Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at //github.com/ZGCTroy/LayoutDiffusion.
Self-supervised Learning (SSL) has been widely applied to learn image representations through exploiting unlabeled images. However, it has not been fully explored in the medical image analysis field. In this work, Saliency-guided Self-Supervised image Transformer (SSiT) is proposed for Diabetic Retinopathy (DR) grading from fundus images. We novelly introduce saliency maps into SSL, with a goal of guiding self-supervised pre-training with domain-specific prior knowledge. Specifically, two saliency-guided learning tasks are employed in SSiT: (1) Saliency-guided contrastive learning is conducted based on the momentum contrast, wherein fundus images' saliency maps are utilized to remove trivial patches from the input sequences of the momentum-updated key encoder. Thus, the key encoder is constrained to provide target representations focusing on salient regions, guiding the query encoder to capture salient features. (2) The query encoder is trained to predict the saliency segmentation, encouraging the preservation of fine-grained information in the learned representations. To assess our proposed method, four publicly-accessible fundus image datasets are adopted. One dataset is employed for pre-training, while the three others are used to evaluate the pre-trained models' performance on downstream DR grading. The proposed SSiT significantly outperforms other representative state-of-the-art SSL methods on all downstream datasets and under various evaluation settings. For example, SSiT achieves a Kappa score of 81.88% on the DDR dataset under fine-tuning evaluation, outperforming all other ViT-based SSL methods by at least 9.48%.
UNet and its variants have been widely used in medical image segmentation. However, these models, especially those based on Transformer architectures, pose challenges due to their large number of parameters and computational loads, making them unsuitable for mobile health applications. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as competitive alternatives to CNN and Transformer architectures. Building upon this, we employ Mamba as a lightweight substitute for CNN and Transformer within UNet, aiming at tackling challenges stemming from computational resource limitations in real medical settings. To this end, we introduce the Lightweight Mamba UNet (LightM-UNet) that integrates Mamba and UNet in a lightweight framework. Specifically, LightM-UNet leverages the Residual Vision Mamba Layer in a pure Mamba fashion to extract deep semantic features and model long-range spatial dependencies, with linear computational complexity. Extensive experiments conducted on two real-world 2D/3D datasets demonstrate that LightM-UNet surpasses existing state-of-the-art literature. Notably, when compared to the renowned nnU-Net, LightM-UNet achieves superior segmentation performance while drastically reducing parameter and computation costs by 116x and 21x, respectively. This highlights the potential of Mamba in facilitating model lightweighting. Our code implementation is publicly available at //github.com/MrBlankness/LightM-UNet.
Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at //github.com/YBYBZhang/VideoElevator.
Single-frame infrared small target (SIRST) detection aims to recognize small targets from clutter backgrounds. Recently, convolutional neural networks have achieved significant advantages in general object detection. With the development of Transformer, the scale of SIRST models is constantly increasing. Due to the limited training samples, performance has not been improved accordingly. The quality, quantity, and diversity of the infrared dataset are critical to the detection of small targets. To highlight this issue, we propose a negative sample augmentation method in this paper. Specifically, a negative augmentation approach is proposed to generate massive negatives for self-supervised learning. Firstly, we perform a sequential noise modeling technology to generate realistic infrared data. Secondly, we fuse the extracted noise with the original data to facilitate diversity and fidelity in the generated data. Lastly, we proposed a negative augmentation strategy to enrich diversity as well as maintain semantic invariance. The proposed algorithm produces a synthetic SIRST-5K dataset, which contains massive pseudo-data and corresponding labels. With a rich diversity of infrared small target data, our algorithm significantly improves the model performance and convergence speed. Compared with other state-of-the-art (SOTA) methods, our method achieves outstanding performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection over union (IoU).
Using Non-negative Matrix Factorization (NMF), the observed matrix can be approximated by the product of the basis and coefficient matrices. Moreover, if the coefficient vectors are explained by the covariates for each individual, the coefficient matrix can be written as the product of the parameter matrix and the covariate matrix, and additionally described in the framework of Non-negative Matrix tri-Factorization (tri-NMF) with covariates. Consequently, this is equal to the mean structure of the Growth Curve Model (GCM). The difference is that the basis matrix for GCM is given by the analyst, whereas that for NMF with covariates is unknown and optimized. In this study, we applied NMF with covariance to longitudinal data and compared it with GCM. We have also published an R package that implements this method, and we show how to use it through examples of data analyses including longitudinal measurement, spatiotemporal data and text data. In particular, we demonstrate the usefulness of Gaussian kernel functions as covariates.
Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{//hcplayercvpr2024.github.io}.
As a scene graph compactly summarizes the high-level content of an image in a structured and symbolic manner, the similarity between scene graphs of two images reflects the relevance of their contents. Based on this idea, we propose a novel approach for image-to-image retrieval using scene graph similarity measured by graph neural networks. In our approach, graph neural networks are trained to predict the proxy image relevance measure, computed from human-annotated captions using a pre-trained sentence similarity model. We collect and publish the dataset for image relevance measured by human annotators to evaluate retrieval algorithms. The collected dataset shows that our method agrees well with the human perception of image similarity than other competitive baselines.
Within the rapidly developing Internet of Things (IoT), numerous and diverse physical devices, Edge devices, Cloud infrastructure, and their quality of service requirements (QoS), need to be represented within a unified specification in order to enable rapid IoT application development, monitoring, and dynamic reconfiguration. But heterogeneities among different configuration knowledge representation models pose limitations for acquisition, discovery and curation of configuration knowledge for coordinated IoT applications. This paper proposes a unified data model to represent IoT resource configuration knowledge artifacts. It also proposes IoT-CANE (Context-Aware recommendatioN systEm) to facilitate incremental knowledge acquisition and declarative context driven knowledge recommendation.
Image segmentation is considered to be one of the critical tasks in hyperspectral remote sensing image processing. Recently, convolutional neural network (CNN) has established itself as a powerful model in segmentation and classification by demonstrating excellent performances. The use of a graphical model such as a conditional random field (CRF) contributes further in capturing contextual information and thus improving the segmentation performance. In this paper, we propose a method to segment hyperspectral images by considering both spectral and spatial information via a combined framework consisting of CNN and CRF. We use multiple spectral cubes to learn deep features using CNN, and then formulate deep CRF with CNN-based unary and pairwise potential functions to effectively extract the semantic correlations between patches consisting of three-dimensional data cubes. Effective piecewise training is applied in order to avoid the computationally expensive iterative CRF inference. Furthermore, we introduce a deep deconvolution network that improves the segmentation masks. We also introduce a new dataset and experimented our proposed method on it along with several widely adopted benchmark datasets to evaluate the effectiveness of our method. By comparing our results with those from several state-of-the-art models, we show the promising potential of our method.