We propose Point'n Move, a method that achieves interactive scene object manipulation with exposed region inpainting. Interactivity here further comes from intuitive object selection and real-time editing. To achieve this, we adopt Gaussian Splatting Radiance Field as the scene representation and fully leverage its explicit nature and speed advantage. Its explicit representation formulation allows us to devise a 2D prompt points to 3D mask dual-stage self-prompting segmentation algorithm, perform mask refinement and merging, minimize change as well as provide good initialization for scene inpainting and perform editing in real-time without per-editing training, all leads to superior quality and performance. We test our method by performing editing on both forward-facing and 360 scenes. We also compare our method against existing scene object removal methods, showing superior quality despite being more capable and having a speed advantage.
Recently, Transformers have been introduced into the field of acoustics recognition. They are pre-trained on large-scale datasets using methods such as supervised learning and semi-supervised learning, demonstrating robust generality--It fine-tunes easily to downstream tasks and shows more robust performance. However, the predominant fine-tuning method currently used is still full fine-tuning, which involves updating all parameters during training. This not only incurs significant memory usage and time costs but also compromises the model's generality. Other fine-tuning methods either struggle to address this issue or fail to achieve matching performance. Therefore, we conducted a comprehensive analysis of existing fine-tuning methods and proposed an efficient fine-tuning approach based on Adapter tuning, namely AAT. The core idea is to freeze the audio Transformer model and insert extra learnable Adapters, efficiently acquiring downstream task knowledge without compromising the model's original generality. Extensive experiments have shown that our method achieves performance comparable to or even superior to full fine-tuning while optimizing only 7.118% of the parameters. It also demonstrates superiority over other fine-tuning methods.
Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. Although current state-of-the-art models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by multiple self-attention layers, while eschewing the traditional sequence decoder. This design choice results in a lightweight and efficient model capable of handling inputs of varying sizes. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of VIPTR. Notably, the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which blends high accuracy with efficiency and greatly benefits real-world applications requiring fast and reliable text recognition. The code is publicly available at //github.com/cxfyxl/VIPTR.
Arbitrary shape scene text detection is of great importance in scene understanding tasks. Due to the complexity and diversity of text in natural scenes, existing scene text algorithms have limited accuracy for detecting arbitrary shape text. In this paper, we propose a novel arbitrary shape scene text detector through boundary points dynamic optimization(BPDO). The proposed model is designed with a text aware module (TAM) and a boundary point dynamic optimization module (DOM). Specifically, the model designs a text aware module based on segmentation to obtain boundary points describing the central region of the text by extracting a priori information about the text region. Then, based on the idea of deformable attention, it proposes a dynamic optimization model for boundary points, which gradually optimizes the exact position of the boundary points based on the information of the adjacent region of each boundary point. Experiments on CTW-1500, Total-Text, and MSRA-TD500 datasets show that the model proposed in this paper achieves a performance that is better than or comparable to the state-of-the-art algorithm, proving the effectiveness of the model.
Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameters search space during parameter updates while training a model....
Artificial Intelligence (AI), particularly through the advent of large-scale generative AI (GenAI) models such as Large Language Models (LLMs), has become a transformative element in contemporary technology. While these models have unlocked new possibilities, they simultaneously present significant challenges, such as concerns over data privacy and the propensity to generate misleading or fabricated content. Current frameworks for Responsible AI (RAI) often fall short in providing the granular guidance necessary for tangible application, especially for Accountability-a principle that is pivotal for ensuring transparent and auditable decision-making, bolstering public trust, and meeting increasing regulatory expectations. This study bridges the accountability gap by introducing our effort towards a comprehensive metrics catalogue, formulated through a systematic multivocal literature review (MLR) that integrates findings from both academic and grey literature. Our catalogue delineates process metrics that underpin procedural integrity, resource metrics that provide necessary tools and frameworks, and product metrics that reflect the outputs of AI systems. This tripartite framework is designed to operationalize Accountability in AI, with a special emphasis on addressing the intricacies of GenAI.
Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.
Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
Most existing event extraction (EE) methods merely extract event arguments within the sentence scope. However, such sentence-level EE methods struggle to handle soaring amounts of documents from emerging applications, such as finance, legislation, health, etc., where event arguments always scatter across different sentences, and even multiple such event mentions frequently co-exist in the same document. To address these challenges, we propose a novel end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic graph to fulfill the document-level EE (DEE) effectively. Moreover, we reformalize a DEE task with the no-trigger-words design to ease the document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we build a large-scale real-world dataset consisting of Chinese financial announcements with the challenges mentioned above. Extensive experiments with comprehensive analyses illustrate the superiority of Doc2EDAG over state-of-the-art methods. Data and codes can be found at //github.com/dolphin-zs/Doc2EDAG.
Aspect level sentiment classification aims to identify the sentiment expressed towards an aspect given a context sentence. Previous neural network based methods largely ignore the syntax structure in one sentence. In this paper, we propose a novel target-dependent graph attention network (TD-GAT) for aspect level sentiment classification, which explicitly utilizes the dependency relationship among words. Using the dependency graph, it propagates sentiment features directly from the syntactic context of an aspect target. In our experiments, we show our method outperforms multiple baselines with GloVe embeddings. We also demonstrate that using BERT representations further substantially boosts the performance.
Distant supervision can effectively label data for relation extraction, but suffers from the noise labeling problem. Recent works mainly perform soft bag-level noise reduction strategies to find the relatively better samples in a sentence bag, which is suboptimal compared with making a hard decision of false positive samples in sentence level. In this paper, we introduce an adversarial learning framework, which we named DSGAN, to learn a sentence-level true-positive generator. Inspired by Generative Adversarial Networks, we regard the positive samples generated by the generator as the negative samples to train the discriminator. The optimal generator is obtained until the discrimination ability of the discriminator has the greatest decline. We adopt the generator to filter distant supervision training dataset and redistribute the false positive instances into the negative set, in which way to provide a cleaned dataset for relation classification. The experimental results show that the proposed strategy significantly improves the performance of distant supervision relation extraction comparing to state-of-the-art systems.