We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: //groma-mllm.github.io/.
The `Jacobi prior' is an alternative Bayesian method for predictive models. It performs better than well-known methods such as Lasso, Ridge, Elastic Net, and MCMC-based Horse-Shoe Prior, particularly in terms of prediction accuracy and run-time. This method is implemented for Gaussian process classification, adeptly handling a nonlinear decision boundary. The Jacobi prior demonstrates its capability to manage partitioned data across global servers, making it highly useful in distributed computing environments. Additionally, we show that the Jacobi prior is more than a hundred times faster than these methods while maintaining similar predictive accuracy. As the method is both fast and accurate, it is advantageous for organisations looking to reduce their environmental impact and meet ESG standards. To demonstrate the effectiveness of the Jacobi prior, we conducted a detailed simulation study with four experiments focusing on statistical consistency, accuracy, and speed. We also present two empirical studies: the first evaluates credit risk by analysing default probability using data from the U.S. Small Business Administration (SBA), and the second uses the Jacobi prior for classifying stars, quasars, and galaxies in a three-class problem using multinomial logit regression on data from the Sloan Digital Sky Survey. Different filters were used as features in this study. All codes and datasets for this paper are available in the following GitHub repository : //github.com/sourish-cmi/Jacobi-Prior/
Large language models (LLMs) have shown amazing capabilities in knowledge memorization and present. However, when it comes to domain-specific knowledge and downstream tasks like medical, general LLMs are often unable to give precise answers. In addition, when people want LLMs to answer classification questions, they usually go through instruction tuning first, however, LLMs do not always give a direct index of the categorization after instruction tuning. In this paper, we proposed LlamaCare, a fine-tuned medical language model, and Extended Classification Integration(ECI), a module to handle classification problems of LLMs. Our contributions are : (i) We fine-tuned a large language model of medical knowledge with very low carbon emissions and achieved similar performance with ChatGPT by a 24G GPU. (ii) We solved the problem of redundant categorical answers and improved the performance of LLMs by proposing a new module called Extended Classification Integration. (iii) We released our processed data for one-shot and few-shot training for some benchmarks such as PubMedQA and USMLE 1-3 step. Our method achieves a close effect with the state-of-the-art model in benchmarks while costing lower GPU resources compared to LLMs with the same quantity of parameters. Our models, codes, and datasets can be found in //github.com/Stephen-SMJ/LLamaCare
In a real-world infrared imaging system, effectively learning a consistent stripe noise removal model is essential. Most existing destriping methods cannot precisely reconstruct images due to cross-level semantic gaps and insufficient characterization of the global column features. To tackle this problem, we propose a novel infrared image destriping method, called Asymmetric Sampling Correction Network (ASCNet), that can effectively capture global column relationships and embed them into a U-shaped framework, providing comprehensive discriminative representation and seamless semantic connectivity. Our ASCNet consists of three core elements: Residual Haar Discrete Wavelet Transform (RHDWT), Pixel Shuffle (PS), and Column Non-uniformity Correction Module (CNCM). Specifically, RHDWT is a novel downsampler that employs double-branch modeling to effectively integrate stripe-directional prior knowledge and data-driven semantic interaction to enrich the feature representation. Observing the semantic patterns crosstalk of stripe noise, PS is introduced as an upsampler to prevent excessive apriori decoding and performing semantic-bias-free image reconstruction. After each sampling, CNCM captures the column relationships in long-range dependencies. By incorporating column, spatial, and self-dependence information, CNCM well establishes a global context to distinguish stripes from the scene's vertical structures. Extensive experiments on synthetic data, real data, and infrared small target detection tasks demonstrate that the proposed method outperforms state-of-the-art single-image destriping methods both visually and quantitatively. Our code will be made publicly available at //github.com/xdFai/ASCNet.
We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This task requires accurate prediction of each part's pose and shape in sequential steps, and as the number of parts increases, the possible assembly combinations increase exponentially, leading to a combinatorial explosion that severely hinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging weak constraints from assembly sequences, effectively reducing the solution space's complexity. Since assembly part sequences convey construction rules similar to sentences being structured through words, our model explores both parallel and autoregressive generation. It further enhances assembly through knowledge enhancement strategies that utilize the attributes of parts and their sequence information, enabling it to capture the inherent assembly pattern and relationships among sequentially ordered parts. We also construct a more challenging benchmark named PartNet-Assembly covering 21 varied categories to more comprehensively validate the effectiveness of SPAFormer. Extensive experiments demonstrate the superior generalization capabilities of SPAFormer, particularly with multi-tasking and in scenarios requiring long-horizon assembly. Codes and model weights will be released at //github.com/xuboshen/SPAFormer.
Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.
Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as similar 2D projections can correspond to multiple 3D interpretations. Nevertheless, most HMR methods overlook this ambiguity and make a single prediction without accounting for the associated uncertainty. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. MEGA enables us to propose multiple outputs and to evaluate the uncertainty of the predictions. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.
Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of deep generative models, achieving remarkable performance in image synthesis tasks. However, these models face challenges in terms of widespread adoption due to their reliance on sequential denoising steps during sample generation. This dependence leads to substantial computational requirements, making them unsuitable for resource-constrained or real-time processing systems. To address these challenges, we propose a novel method that integrates denoising phases directly into the model's architecture, thereby reducing the need for resource-intensive computations. Our approach combines diffusion models with generative adversarial networks (GANs) through knowledge distillation, enabling more efficient training and evaluation. By utilizing a pre-trained diffusion model as a teacher model, we train a student model through adversarial learning, employing layerwise transformations for denoising and submodules for predicting the teacher model's output at various points in time. This integration significantly reduces the number of parameters and denoising steps required, leading to improved sampling speed at test time. We validate our method with extensive experiments, demonstrating comparable performance with reduced computational requirements compared to existing approaches. By enabling the deployment of diffusion models on resource-constrained devices, our research mitigates their computational burden and paves the way for wider accessibility and practical use across the research community and end-users. Our code is publicly available at //github.com/kidist-amde/Adv-KD
Data plays a fundamental role in the training of Large Language Models (LLMs). Effective data management, particularly in the formulation of a well-suited training dataset, holds significance for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning phases. Despite the considerable importance of data management, the current research community still falls short in providing a systematic analysis of the rationale behind management strategy selection, its consequential effects, methodologies for evaluating curated datasets, and the ongoing pursuit of improved strategies. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various noteworthy aspects of data management strategy design: data quantity, data quality, domain/task composition, etc. Looking toward the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices. The collection of the latest papers is available at //github.com/ZigeW/data_management_LLM.
Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.
ASR (automatic speech recognition) systems like Siri, Alexa, Google Voice or Cortana has become quite popular recently. One of the key techniques enabling the practical use of such systems in people's daily life is deep learning. Though deep learning in computer vision is known to be vulnerable to adversarial perturbations, little is known whether such perturbations are still valid on the practical speech recognition. In this paper, we not only demonstrate such attacks can happen in reality, but also show that the attacks can be systematically conducted. To minimize users' attention, we choose to embed the voice commands into a song, called CommandSong. In this way, the song carrying the command can spread through radio, TV or even any media player installed in the portable devices like smartphones, potentially impacting millions of users in long distance. In particular, we overcome two major challenges: minimizing the revision of a song in the process of embedding commands, and letting the CommandSong spread through the air without losing the voice "command". Our evaluation demonstrates that we can craft random songs to "carry" any commands and the modify is extremely difficult to be noticed. Specially, the physical attack that we play the CommandSongs over the air and record them can success with 94 percentage.