Recently, many self-supervised learning methods for image reconstruction have been proposed that can learn from noisy data alone, bypassing the need for ground-truth references. Most existing methods cluster around two classes: i) Noise2Self and similar cross-validation methods that require very mild knowledge about the noise distribution, and ii) Stein's Unbiased Risk Estimator (SURE) and similar approaches that assume full knowledge of the distribution. The first class of methods is often suboptimal compared to supervised learning, and the second class tends to be impractical, as the noise level is often unknown in real-world applications. In this paper, we provide a theoretical framework that characterizes this expressivity-robustness trade-off and propose a new approach based on SURE, but unlike the standard SURE, does not require knowledge about the noise level. Throughout a series of experiments, we show that the proposed estimator outperforms other existing self-supervised methods on various imaging inverse problems
Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced with visual input perturbations, including variations in lighting and textures, impeding their real-world application. We propose Stem-OB that utilizes pretrained image diffusion models to suppress low-level visual differences while maintaining high-level scene structures. This image inversion process is akin to transforming the observation into a shared representation, from which other observations stem, with extraneous details removed. Stem-OB contrasts with data-augmentation approaches as it is robust to various unspecified appearance changes without the need for additional training. Our method is a simple yet highly effective plug-and-play solution. Empirical results confirm the effectiveness of our approach in simulated tasks and show an exceptionally significant improvement in real-world applications, with an average increase of 22.2% in success rates compared to the best baseline. See //hukz18.github.io/Stem-Ob/ for more info.
CNNs exhibit inherent equivariance to image translation, leading to efficient parameter and data usage, faster learning, and improved robustness. The concept of translation equivariant networks has been successfully extended to rotation transformation using group convolution for discrete rotation groups and harmonic functions for the continuous rotation group encompassing $360^\circ$. We explore the compatibility of the SA mechanism with full rotation equivariance, in contrast to previous studies that focused on discrete rotation. We introduce the Harmformer, a harmonic transformer with a convolutional stem that achieves equivariance for both translation and continuous rotation. Accompanied by an end-to-end equivariance proof, the Harmformer not only outperforms previous equivariant transformers, but also demonstrates inherent stability under any continuous rotation, even without seeing rotated samples during training.
Multi-modal deep metric learning is crucial for effectively capturing diverse representations in tasks such as face verification, fine-grained object recognition, and product search. Traditional approaches to metric learning, whether based on distance or margin metrics, primarily emphasize class separation, often overlooking the intra-class distribution essential for multi-modal feature learning. In this context, we propose a novel loss function called Density-Aware Adaptive Margin Loss(DAAL), which preserves the density distribution of embeddings while encouraging the formation of adaptive sub-clusters within each class. By employing an adaptive line strategy, DAAL not only enhances intra-class variance but also ensures robust inter-class separation, facilitating effective multi-modal representation. Comprehensive experiments on benchmark fine-grained datasets demonstrate the superior performance of DAAL, underscoring its potential in advancing retrieval applications and multi-modal deep metric learning.
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at //github.com/marslanm/multimodality-representation-learning.
Deep learning-based algorithms have seen a massive popularity in different areas of remote sensing image analysis over the past decade. Recently, transformers-based architectures, originally introduced in natural language processing, have pervaded computer vision field where the self-attention mechanism has been utilized as a replacement to the popular convolution operator for capturing long-range dependencies. Inspired by recent advances in computer vision, remote sensing community has also witnessed an increased exploration of vision transformers for a diverse set of tasks. Although a number of surveys have focused on transformers in computer vision in general, to the best of our knowledge we are the first to present a systematic review of recent advances based on transformers in remote sensing. Our survey covers more than 60 recent transformers-based methods for different remote sensing problems in sub-areas of remote sensing: very high-resolution (VHR), hyperspectral (HSI) and synthetic aperture radar (SAR) imagery. We conclude the survey by discussing different challenges and open issues of transformers in remote sensing. Additionally, we intend to frequently update and maintain the latest transformers in remote sensing papers with their respective code at: //github.com/VIROBO-15/Transformer-in-Remote-Sensing
With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: //github.com/Jyouhou/SceneTextPapers.
The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN~(AdaBoosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors and integrate knowledge from different hops of neighbors into the network in an AdaBoost way. We also present the architectural difference between AdaGCN and existing graph convolutional methods to show the benefits of our proposal. Finally, extensive experiments demonstrate the state-of-the-art prediction performance and the computational advantage of our approach AdaGCN.
The cross-domain recommendation technique is an effective way of alleviating the data sparsity in recommender systems by leveraging the knowledge from relevant domains. Transfer learning is a class of algorithms underlying these techniques. In this paper, we propose a novel transfer learning approach for cross-domain recommendation by using neural networks as the base model. We assume that hidden layers in two base networks are connected by cross mappings, leading to the collaborative cross networks (CoNet). CoNet enables dual knowledge transfer across domains by introducing cross connections from one base network to another and vice versa. CoNet is achieved in multi-layer feedforward networks by adding dual connections and joint loss functions, which can be trained efficiently by back-propagation. The proposed model is evaluated on two real-world datasets and it outperforms baseline models by relative improvements of 3.56\% in MRR and 8.94\% in NDCG, respectively.
Convolutional Neural Networks (CNNs) have gained significant traction in the field of machine learning, particularly due to their high accuracy in visual recognition. Recent works have pushed the performance of GPU implementations of CNNs to significantly improve their classification and training times. With these improvements, many frameworks have become available for implementing CNNs on both CPUs and GPUs, with no support for FPGA implementations. In this work we present a modified version of the popular CNN framework Caffe, with FPGA support. This allows for classification using CNN models and specialized FPGA implementations with the flexibility of reprogramming the device when necessary, seamless memory transactions between host and device, simple-to-use test benches, and the ability to create pipelined layer implementations. To validate the framework, we use the Xilinx SDAccel environment to implement an FPGA-based Winograd convolution engine and show that the FPGA layer can be used alongside other layers running on a host processor to run several popular CNNs (AlexNet, GoogleNet, VGG A, Overfeat). The results show that our framework achieves 50 GFLOPS across 3x3 convolutions in the benchmarks. This is achieved within a practical framework, which will aid in future development of FPGA-based CNNs.