3D stylization, which entails the application of specific styles to three-dimensional objects, holds significant commercial potential as it enables the creation of diverse 3D objects with distinct moods and styles, tailored to specific demands of different scenes. With recent advancements in text-driven methods and artificial intelligence, the stylization process is increasingly intuitive and automated, thereby diminishing the reliance on manual labor and expertise. However, existing methods have predominantly focused on holistic stylization, thereby leaving the application of styles to individual components of a 3D object unexplored. In response, we introduce 3DStyleGLIP, a novel framework specifically designed for text-driven, part-tailored 3D stylization. Given a 3D mesh and a text prompt, 3DStyleGLIP leverages the vision-language embedding space of the Grounded Language-Image Pre-training (GLIP) model to localize the individual parts of the 3D mesh and modify their colors and local geometries to align them with the desired styles specified in the text prompt. 3DStyleGLIP is effectively trained for 3D stylization tasks through a part-level style loss working in GLIP's embedding space, supplemented by two complementary learning techniques. Extensive experimental validation confirms that our method achieves significant part-wise stylization capabilities, demonstrating promising potential in advancing the field of 3D stylization.
Legged robots navigating cluttered environments must be jointly agile for efficient task execution and safe to avoid collisions with obstacles or humans. Existing studies either develop conservative controllers (< 1.0 m/s) to ensure safety, or focus on agility without considering potentially fatal collisions. This paper introduces Agile But Safe (ABS), a learning-based control framework that enables agile and collision-free locomotion for quadrupedal robots. ABS involves an agile policy to execute agile motor skills amidst obstacles and a recovery policy to prevent failures, collaboratively achieving high-speed and collision-free navigation. The policy switch in ABS is governed by a learned control-theoretic reach-avoid value network, which also guides the recovery policy as an objective function, thereby safeguarding the robot in a closed loop. The training process involves the learning of the agile policy, the reach-avoid value network, the recovery policy, and an exteroception representation network, all in simulation. These trained modules can be directly deployed in the real world with onboard sensing and computation, leading to high-speed and collision-free navigation in confined indoor and outdoor spaces with both static and dynamic obstacles.
Urban spatio-temporal prediction is crucial for informed decision-making, such as transportation management, resource optimization, and urban planning. Although pretrained foundation models for natural languages have experienced remarkable breakthroughs, wherein one general-purpose model can tackle multiple tasks across various domains, urban spatio-temporal modeling lags behind. Existing approaches for urban prediction are usually tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive in-domain training data. In this work, we propose a universal model, UniST, for urban spatio-temporal prediction. Drawing inspiration from large language models, UniST achieves success through: (i) flexibility towards diverse spatio-temporal data characteristics, (ii) effective generative pre-training with elaborated masking strategies to capture complex spatio-temporal relationships, (iii) spatio-temporal knowledge-guided prompts that align and leverage intrinsic and shared knowledge across scenarios. These designs together unlock the potential of a one-for-all model for spatio-temporal prediction with powerful generalization capability. Extensive experiments on 15 cities and 6 domains demonstrate the universality of UniST in advancing state-of-the-art prediction performance, especially in few-shot and zero-shot scenarios.
Human annotation is a time-consuming task that requires a significant amount of effort. To address this issue, interactive data annotation utilizes an annotation model to provide suggestions for humans to approve or correct. However, annotation models trained with limited labeled data are prone to generating incorrect suggestions, leading to extra human correction effort. To tackle this challenge, we propose Araida, an analogical reasoning-based approach that enhances automatic annotation accuracy in the interactive data annotation setting and reduces the need for human corrections. Araida involves an error-aware integration strategy that dynamically coordinates an annotation model and a k-nearest neighbors (KNN) model, giving more importance to KNN's predictions when predictions from the annotation model are deemed inaccurate. Empirical studies demonstrate that Araida is adaptable to different annotation tasks and models. On average, it reduces human correction labor by 11.02% compared to vanilla interactive data annotation methods.
Learning representations through self-supervision on unlabeled data has proven highly effective for understanding diverse images. However, remote sensing images often have complex and densely populated scenes with multiple land objects and no clear foreground objects. This intrinsic property generates high object density, resulting in false positive pairs or missing contextual information in self-supervised learning. To address these problems, we propose a context-enhanced masked image modeling method (CtxMIM), a simple yet efficient MIM-based self-supervised learning for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. A context-enhanced generative branch is introduced to provide contextual information through context consistency constraints in the reconstruction. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset without specific temporal or geographical constraints. Finally, extensive experiments show that features learned by CtxMIM outperform fully supervised and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that CtxMIM learns impressive remote sensing representations with high generalization and transferability. Code and data will be made public available.
This paper pursues the insight that language models naturally enable an intelligent variation operator similar in spirit to evolutionary crossover. In particular, language models of sufficient scale demonstrate in-context learning, i.e. they can learn from associations between a small number of input patterns to generate outputs incorporating such associations (also called few-shot prompting). This ability can be leveraged to form a simple but powerful variation operator, i.e. to prompt a language model with a few text-based genotypes (such as code, plain-text sentences, or equations), and to parse its corresponding output as those genotypes' offspring. The promise of such language model crossover (which is simple to implement and can leverage many different open-source language models) is that it enables a simple mechanism to evolve semantically-rich text representations (with few domain-specific tweaks), and naturally benefits from current progress in language models. Experiments in this paper highlight the versatility of language-model crossover, through evolving binary bit-strings, sentences, equations, text-to-image prompts, and Python code. The conclusion is that language model crossover is a promising method for evolving genomes representable as text.
We present cVIL, a class-centric approach to visual interactive labeling, which facilitates human annotation of large and complex image data sets. cVIL uses different property measures to support instance labeling for labeling difficult instances and batch labeling to quickly label easy instances. Simulated experiments reveal that cVIL with batch labeling can outperform traditional labeling approaches based on active learning. In a user study, cVIL led to better accuracy and higher user preference compared to a traditional instance-based visual interactive labeling approach based on 2D scatterplots.
Within recent times, cybercriminals have curated a variety of organised and resolute cyber attacks within a range of cyber systems, leading to consequential ramifications to private and governmental institutions. Current security-based automation and orchestrations focus on automating fixed purpose and hard-coded solutions, which are easily surpassed by modern-day cyber attacks. Research within Automated Cyber Defence will allow the development and enabling intelligence response by autonomously defending networked systems through sequential decision-making agents. This article comprehensively elaborates the developments within Automated Cyber Defence through a requirement analysis divided into two sub-areas, namely, automated defence and attack agents and Autonomous Cyber Operation (ACO) Gyms. The requirement analysis allows the comparison of automated agents and highlights the importance of ACO Gyms for their continual development. The requirement analysis is also used to critique ACO Gyms with an overall aim to develop them for deploying automated agents within real-world networked systems. Relevant future challenges were addressed from the overall analysis to accelerate development within the area of Automated Cyber Defence.
Generative commonsense reasoning which aims to empower machines to generate sentences with the capacity of reasoning over a set of concepts is a critical bottleneck for text generation. Even the state-of-the-art pre-trained language generation models struggle at this task and often produce implausible and anomalous sentences. One reason is that they rarely consider incorporating the knowledge graph which can provide rich relational information among the commonsense concepts. To promote the ability of commonsense reasoning for text generation, we propose a novel knowledge graph augmented pre-trained language generation model KG-BART, which encompasses the complex relations of concepts through the knowledge graph and produces more logical and natural sentences as output. Moreover, KG-BART can leverage the graph attention to aggregate the rich concept semantics that enhances the model generalization on unseen concept sets. Experiments on benchmark CommonGen dataset verify the effectiveness of our proposed approach by comparing with several strong pre-trained language generation models, particularly KG-BART outperforms BART by 5.80, 4.60, in terms of BLEU-3, 4. Moreover, we also show that the generated context by our model can work as background scenarios to benefit downstream commonsense QA tasks.
To provide more accurate, diverse, and explainable recommendation, it is compulsory to go beyond modeling user-item interactions and take side information into account. Traditional methods like factorization machine (FM) cast it as a supervised learning problem, which assumes each interaction as an independent instance with side information encoded. Due to the overlook of the relations among instances or items (e.g., the director of a movie is also an actor of another movie), these methods are insufficient to distill the collaborative signal from the collective behaviors of users. In this work, we investigate the utility of knowledge graph (KG), which breaks down the independent interaction assumption by linking items with their attributes. We argue that in such a hybrid structure of KG and user-item graph, high-order relations --- which connect two items with one or multiple linked attributes --- are an essential factor for successful recommendation. We propose a new method named Knowledge Graph Attention Network (KGAT) which explicitly models the high-order connectivities in KG in an end-to-end fashion. It recursively propagates the embeddings from a node's neighbors (which can be users, items, or attributes) to refine the node's embedding, and employs an attention mechanism to discriminate the importance of the neighbors. Our KGAT is conceptually advantageous to existing KG-based recommendation methods, which either exploit high-order relations by extracting paths or implicitly modeling them with regularization. Empirical results on three public benchmarks show that KGAT significantly outperforms state-of-the-art methods like Neural FM and RippleNet. Further studies verify the efficacy of embedding propagation for high-order relation modeling and the interpretability benefits brought by the attention mechanism.
Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.