Current multilingual semantic parsing (MSP) datasets are almost all collected by translating the utterances in the existing datasets from the resource-rich language to the target language. However, manual translation is costly. To reduce the translation effort, this paper proposes the first active learning procedure for MSP (AL-MSP). AL-MSP selects only a subset from the existing datasets to be translated. We also propose a novel selection method that prioritizes the examples diversifying the logical form structures with more lexical choices, and a novel hyperparameter tuning method that needs no extra annotation cost. Our experiments show that AL-MSP significantly reduces translation costs with ideal selection methods. Our selection method with proper hyperparameters yields better parsing performance than the other baselines on two multilingual datasets.
Label distribution (LD) uses the description degree to describe instances, which provides more fine-grained supervision information when learning with label ambiguity. Nevertheless, LD is unavailable in many real-world applications. To obtain LD, label enhancement (LE) has emerged to recover LD from logical label. Existing LE approach have the following problems: (\textbf{i}) They use logical label to train mappings to LD, but the supervision information is too loose, which can lead to inaccurate model prediction; (\textbf{ii}) They ignore feature redundancy and use the collected features directly. To solve (\textbf{i}), we use the topology of the feature space to generate more accurate label-confidence. To solve (\textbf{ii}), we proposed a novel supervised LE dimensionality reduction approach, which projects the original data into a lower dimensional feature space. Combining the above two, we obtain the augmented data for LE. Further, we proposed a novel nonlinear LE model based on the label-confidence and reduced features. Extensive experiments on 12 real-world datasets are conducted and the results show that our method consistently outperforms the other five comparing approaches.
Whilst the availability of 3D LiDAR point cloud data has significantly grown in recent years, annotation remains expensive and time-consuming, leading to a demand for semi-supervised semantic segmentation methods with application domains such as autonomous driving. Existing work very often employs relatively large segmentation backbone networks to improve segmentation accuracy, at the expense of computational costs. In addition, many use uniform sampling to reduce ground truth data requirements for learning needed, often resulting in sub-optimal performance. To address these issues, we propose a new pipeline that employs a smaller architecture, requiring fewer ground-truth annotations to achieve superior segmentation accuracy compared to contemporary approaches. This is facilitated via a novel Sparse Depthwise Separable Convolution module that significantly reduces the network parameter count while retaining overall task performance. To effectively sub-sample our training data, we propose a new Spatio-Temporal Redundant Frame Downsampling (ST-RFD) method that leverages knowledge of sensor motion within the environment to extract a more diverse subset of training data frame samples. To leverage the use of limited annotated data samples, we further propose a soft pseudo-label method informed by LiDAR reflectivity. Our method outperforms contemporary semi-supervised work in terms of mIoU, using less labeled data, on the SemanticKITTI (59.5@5%) and ScribbleKITTI (58.1@5%) benchmark datasets, based on a 2.3x reduction in model parameters and 641x fewer multiply-add operations whilst also demonstrating significant performance improvement on limited training data (i.e., Less is More).
Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.
Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.
Event extraction (EE) plays an important role in many industrial application scenarios, and high-quality EE methods require a large amount of manual annotation data to train supervised learning models. However, the cost of obtaining annotation data is very high, especially for annotation of domain events, which requires the participation of experts from corresponding domain. So we introduce active learning (AL) technology to reduce the cost of event annotation. But the existing AL methods have two main problems, which make them not well used for event extraction. Firstly, the existing pool-based selection strategies have limitations in terms of computational cost and sample validity. Secondly, the existing evaluation of sample importance lacks the use of local sample information. In this paper, we present a novel deep AL method for EE. We propose a batch-based selection strategy and a Memory-Based Loss Prediction model (MBLP) to select unlabeled samples efficiently. During the selection process, we use an internal-external sample loss ranking method to evaluate the sample importance by using local information. Finally, we propose a delayed training strategy to train the MBLP model. Extensive experiments are performed on three domain datasets, and our method outperforms other state-of-the-art methods.
Autoencoding, which aims to reconstruct the input images through a bottleneck latent representation, is one of the classic feature representation learning strategies. It has been shown effective as an auxiliary task for semi-supervised learning but has become less popular as more sophisticated methods have been proposed in recent years. In this paper, we revisit the idea of using image reconstruction as the auxiliary task and incorporate it with a modern semi-supervised semantic segmentation framework. Surprisingly, we discover that such an old idea in semi-supervised learning can produce results competitive with state-of-the-art semantic segmentation algorithms. By visualizing the intermediate layer activations of the image reconstruction module, we show that the feature map channel could correlate well with the semantic concept, which explains why joint training with the reconstruction task is helpful for the segmentation task. Motivated by our observation, we further proposed a modification to the image reconstruction task, aiming to further disentangle the object clue from the background patterns. From experiment evaluation on various datasets, we show that using reconstruction as auxiliary loss can lead to consistent improvements in various datasets and methods. The proposed method can further lead to significant improvement in object-centric segmentation tasks.
Recently, graph neural networks (GNNs) have been widely used for document classification. However, most existing methods are based on static word co-occurrence graphs without sentence-level information, which poses three challenges:(1) word ambiguity, (2) word synonymity, and (3) dynamic contextual dependency. To address these challenges, we propose a novel GNN-based sparse structure learning model for inductive document classification. Specifically, a document-level graph is initially generated by a disjoint union of sentence-level word co-occurrence graphs. Our model collects a set of trainable edges connecting disjoint words between sentences and employs structure learning to sparsely select edges with dynamic contextual dependencies. Graphs with sparse structures can jointly exploit local and global contextual information in documents through GNNs. For inductive learning, the refined document graph is further fed into a general readout function for graph-level classification and optimization in an end-to-end manner. Extensive experiments on several real-world datasets demonstrate that the proposed model outperforms most state-of-the-art results, and reveal the necessity to learn sparse structures for each document.
The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention based feature embedding that captures both entity and relation features in any given entity's neighborhood. Additionally, we also encapsulate relation clusters and multihop relations in our model. Our empirical study offers insights into the efficacy of our attention based model and we show marked performance gains in comparison to state of the art methods on all datasets.
Generic object detection, aiming at locating object instances from a large number of predefined categories in natural images, is one of the most fundamental and challenging problems in computer vision. Deep learning techniques have emerged in recent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the field of generic object detection. Given this time of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought by deep learning techniques. More than 250 key contributions are included in this survey, covering many aspects of generic object detection research: leading detection frameworks and fundamental subproblems including object feature representation, object proposal generation, context information modeling and training strategies; evaluation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance. We finish by identifying promising directions for future research.
Salient object detection is a fundamental problem and has been received a great deal of attentions in computer vision. Recently deep learning model became a powerful tool for image feature extraction. In this paper, we propose a multi-scale deep neural network (MSDNN) for salient object detection. The proposed model first extracts global high-level features and context information over the whole source image with recurrent convolutional neural network (RCNN). Then several stacked deconvolutional layers are adopted to get the multi-scale feature representation and obtain a series of saliency maps. Finally, we investigate a fusion convolution module (FCM) to build a final pixel level saliency map. The proposed model is extensively evaluated on four salient object detection benchmark datasets. Results show that our deep model significantly outperforms other 12 state-of-the-art approaches.