In this paper, we propose a new distillation method for extracting knowledge from Large Foundation Models (LFM) into lightweight models, introducing a novel supervision mode that does not require manually annotated data. While LFMs exhibit exceptional zero-shot classification abilities across datasets, relying solely on LFM-generated embeddings for distillation poses two main challenges: LFM's task-irrelevant knowledge and the high density of features. The transfer of task-irrelevant knowledge could compromise the student model's discriminative capabilities, and the high density of features within target domains obstructs the extraction of discriminative knowledge essential for the task. To address this issue, we introduce the Proxy Relational Graph (PRG) method. We initially extract task-relevant knowledge from LFMs by calculating a weighted average of logits obtained through text prompt embeddings. Then we construct sample-class proxy graphs for LFM and student models, respectively, to model the correlation between samples and class proxies. Then, we achieve the distillation of selective knowledge by aligning the relational graphs produced by both the LFM and the student model. Specifically, the distillation from LFM to the student model is achieved through two types of alignment: 1) aligning the sample nodes produced by the student model with those produced by the LFM, and 2) aligning the edge relationships in the student model's graph with those in the LFM's graph. Our experimental results validate the effectiveness of PRG, demonstrating its ability to leverage the extensive knowledge base of LFMs while skillfully circumventing their inherent limitations in focused learning scenarios. Notably, in our annotation-free framework, PRG achieves an accuracy of 76.23\% (T: 77.9\%) on CIFAR-100 and 72.44\% (T: 75.3\%) on the ImageNet-1K.
We introduce a novel anchor-free contrastive learning (AFCL) method leveraging our proposed Similarity-Orthogonality (SimO) loss. Our approach minimizes a semi-metric discriminative loss function that simultaneously optimizes two key objectives: reducing the distance and orthogonality between embeddings of similar inputs while maximizing these metrics for dissimilar inputs, facilitating more fine-grained contrastive learning. The AFCL method, powered by SimO loss, creates a fiber bundle topological structure in the embedding space, forming class-specific, internally cohesive yet orthogonal neighborhoods. We validate the efficacy of our method on the CIFAR-10 dataset, providing visualizations that demonstrate the impact of SimO loss on the embedding space. Our results illustrate the formation of distinct, orthogonal class neighborhoods, showcasing the method's ability to create well-structured embeddings that balance class separation with intra-class variability. This work opens new avenues for understanding and leveraging the geometric properties of learned representations in various machine learning tasks.
In this paper, we propose HE-Drive: the first human-like-centric end-to-end autonomous driving system to generate trajectories that are both temporally consistent and comfortable. Recent studies have shown that imitation learning-based planners and learning-based trajectory scorers can effectively generate and select accuracy trajectories that closely mimic expert demonstrations. However, such trajectory planners and scorers face the dilemma of generating temporally inconsistent and uncomfortable trajectories. To solve the above problems, Our HE-Drive first extracts key 3D spatial representations through sparse perception, which then serves as conditional inputs for a Conditional Denoising Diffusion Probabilistic Models (DDPMs)-based motion planner to generate temporal consistency multi-modal trajectories. A Vision-Language Models (VLMs)-guided trajectory scorer subsequently selects the most comfortable trajectory from these candidates to control the vehicle, ensuring human-like end-to-end driving. Experiments show that HE-Drive not only achieves state-of-the-art performance (i.e., reduces the average collision rate by 71% than VAD) and efficiency (i.e., 1.9X faster than SparseDrive) on the challenging nuScenes and OpenScene datasets but also provides the most comfortable driving experience on real-world data.For more information, visit the project website: //jmwang0117.github.io/HE-Drive/.
In this paper, we present noise-domain non-orthogonal multiple access (ND-NOMA), an innovative communication scheme that utilizes the modulation of artificial noise mean and variance to convey information. Distinct from traditional methods such as power-domain non-orthogonal multiple access (PD-NOMA) that heavily relies on successive interference cancellation (SIC), ND-NOMA utilizes the noise domain, considerably reducing power consumption and system complexity. Inspired by noise modulation, ND-NOMA enhances energy efficiency and provides lower bit error probability (BEP), making it highly suitable for next-generation Internet-of-things (IoT) networks. Our theoretical analyses and computer simulations reveal that ND-NOMA can achieve exceptionally low bit error rates in both uplink and downlink scenarios, in the presence of Rician fading channels. The proposed multi-user system is supported by a minimum distance detector for mean detection and a threshold-based detector for variance detection, ensuring robust communication in low-power environments. By leveraging the inherent properties of noise, ND-NOMA offers a promising platform for long-term deployments of low-cost and low-complexity devices.
Programming instructors often conduct collaborative learning activities, like Peer Instruction, to foster a deeper understanding in students and enhance their engagement with learning. These activities, however, may not always yield productive outcomes due to the diversity of student mental models and their ineffective collaboration. In this work, we introduce VizGroup, an AI-assisted system that enables programming instructors to easily oversee students' real-time collaborative learning behaviors during large programming courses. VizGroup leverages Large Language Models (LLMs) to recommend event specifications for instructors so that they can simultaneously track and receive alerts about key correlation patterns between various collaboration metrics and ongoing coding tasks. We evaluated VizGroup with 12 instructors in a comparison study using a dataset collected from a Peer Instruction activity that was conducted in a large programming lecture. The results showed that VizGroup helped instructors effectively overview, narrow down, and track nuances throughout students' behaviors.
In this paper, we introduce CalliffusionV2, a novel system designed to produce natural Chinese calligraphy with flexible multi-modal control. Unlike previous approaches that rely solely on image or text inputs and lack fine-grained control, our system leverages both images to guide generations at fine-grained levels and natural language texts to describe the features of generations. CalliffusionV2 excels at creating a broad range of characters and can quickly learn new styles through a few-shot learning approach. It is also capable of generating non-Chinese characters without prior training. Comprehensive tests confirm that our system produces calligraphy that is both stylistically accurate and recognizable by neural network classifiers and human evaluators.
We present CoDEx, a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from the popular FB15K-237 knowledge graph completion dataset by showing that CoDEx covers more diverse and interpretable content, and is a more difficult link prediction benchmark. Data, code, and pretrained models are available at //bit.ly/2EPbrJs.
We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint representations of vision and language. ERNIE-ViL tries to construct the detailed semantic connections (objects, attributes of objects and relationships between objects in visual scenes) across vision and language, which are essential to vision-language cross-modal tasks. Incorporating knowledge from scene graphs, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction in the pre-training phase. More specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language. Pre-trained on two large image-text alignment datasets (Conceptual Captions and SBU), ERNIE-ViL learns better and more robust joint representations. It achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE-ViL. Furthermore, it ranked the 1st place on the VCR leader-board with an absolute improvement of 3.7%.
We consider an interesting problem-salient instance segmentation in this paper. Other than producing bounding boxes, our network also outputs high-quality instance-level segments. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also its surrounding context, enabling us to distinguish the instances in the same scope even with obstruction. Our network is end-to-end trainable and runs at a fast speed (40 fps when processing an image with resolution 320x320). We evaluate our approach on a publicly available benchmark and show that it outperforms other alternative solutions. We also provide a thorough analysis of the design choices to help readers better understand the functions of each part of our network. The source code can be found at \url{//github.com/RuochenFan/S4Net}.
We present MMKG, a collection of three knowledge graphs that contain both numerical features and (links to) images for all entities as well as entity alignments between pairs of KGs. Therefore, multi-relational link prediction and entity matching communities can benefit from this resource. We believe this data set has the potential to facilitate the development of novel multi-modal learning approaches for knowledge graphs.We validate the utility ofMMKG in the sameAs link prediction task with an extensive set of experiments. These experiments show that the task at hand benefits from learning of multiple feature types.
We study the problem of learning to reason in large scale knowledge graphs (KGs). More specifically, we describe a novel reinforcement learning framework for learning multi-hop relational paths: we use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, our approach includes a reward function that takes the accuracy, diversity, and efficiency into consideration. Experimentally, we show that our proposed method outperforms a path-ranking based algorithm and knowledge graph embedding methods on Freebase and Never-Ending Language Learning datasets.