ChatGPT has piqued the interest of many fields, particularly in the academic community. GPT-4, the latest version, starts supporting multimodal input and output. This study examines social media posts to analyze how the Chinese public perceives the potential of ChatGPT for educational and general purposes. The study also serves as the first effort to investigate the changes in public opinion since the release of GPT-4. According to the analysis results, prior to GPT-4, although some social media users believed that AI advancements would benefit education and society, some believed that advanced AI, such as ChatGPT, would make humans feel inferior and lead to problems such as cheating and a decline in moral principles, while the majority remain neutral. Interestingly, public attitudes have tended to shift in a positive direction since the release of GPT-4. We present a thorough analysis of the trending shift and a roadmap to ensure the ethical application of ChatGPT-like models in education and beyond.
Large vision-language models (VLMs) are widely getting adopted in industry and academia. In this work we build a unified framework to systematically evaluate gender-profession bias in VLMs. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. We construct a synthetic, high-quality dataset of text and images that blurs gender distinctions across professional actions to benchmark gender bias. In our benchmarking of recent vision-language models (VLMs), we observe that different input-output modalities result in distinct bias magnitudes and directions. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.
Over the past two decades, research in the field of Simultaneous Localization and Mapping (SLAM) has undergone a significant evolution, highlighting its critical role in enabling autonomous exploration of unknown environments. This evolution ranges from hand-crafted methods, through the era of deep learning, to more recent developments focused on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) representations. Recognizing the growing body of research and the absence of a comprehensive survey on the topic, this paper aims to provide the first comprehensive overview of SLAM progress through the lens of the latest advancements in radiance fields. It sheds light on the background, evolutionary path, inherent strengths and limitations, and serves as a fundamental reference to highlight the dynamic progress and specific challenges.
The subjective perception of emotion leads to inconsistent labels from human annotators. Typically, utterances lacking majority-agreed labels are excluded when training an emotion classifier, which cause problems when encountering ambiguous emotional expressions during testing. This paper investigates three methods to handle ambiguous emotion. First, we show that incorporating utterances without majority-agreed labels as an additional class in the classifier reduces the classification performance of the other emotion classes. Then, we propose detecting utterances with ambiguous emotions as out-of-domain samples by quantifying the uncertainty in emotion classification using evidential deep learning. This approach retains the classification accuracy while effectively detects ambiguous emotion expressions. Furthermore, to obtain fine-grained distinctions among ambiguous emotions, we propose representing emotion as a distribution instead of a single class label. The task is thus re-framed from classification to distribution estimation where every individual annotation is taken into account, not just the majority opinion. The evidential uncertainty measure is extended to quantify the uncertainty in emotion distribution estimation. Experimental results on the IEMOCAP and CREMA-D datasets demonstrate the superior capability of the proposed method in terms of majority class prediction, emotion distribution estimation, and uncertainty estimation.
The adoption of Artificial Intelligence (AI) based Virtual Network Functions (VNFs) has witnessed significant growth, posing a critical challenge in orchestrating AI models within next-generation 6G networks. Finding optimal AI model placement is significantly more challenging than placing traditional software-based VNFs, due to the introduction of numerous uncertain factors by AI models, such as varying computing resource consumption, dynamic storage requirements, and changing model performance. To address the AI model placement problem under uncertainties, this paper presents a novel approach employing a sequence-to-sequence (S2S) neural network which considers uncertainty estimations. The S2S model, characterized by its encoding-decoding architecture, is designed to take the service chain with a number of AI models as input and produce the corresponding placement of each AI model. To address the introduced uncertainties, our methodology incorporates the orthonormal certificate module for uncertainty estimation and utilizes fuzzy logic for uncertainty representation, thereby enhancing the capabilities of the S2S model. Experiments demonstrate that the proposed method achieves competitive results across diverse AI model profiles, network environments, and service chain requests.
Recently, Foundation Models (FMs), with their extensive knowledge bases and complex architectures, have offered unique opportunities within the realm of recommender systems (RSs). In this paper, we attempt to thoroughly examine FM-based recommendation systems (FM4RecSys). We start by reviewing the research background of FM4RecSys. Then, we provide a systematic taxonomy of existing FM4RecSys research works, which can be divided into four different parts including data characteristics, representation learning, model type, and downstream tasks. Within each part, we review the key recent research developments, outlining the representative models and discussing their characteristics. Moreover, we elaborate on the open problems and opportunities of FM4RecSys aiming to shed light on future research directions in this area. In conclusion, we recap our findings and discuss the emerging trends in this field.
Code-trained language models have proven to be highly effective for various code intelligence tasks. However, they can be challenging to train and deploy for many software engineering applications due to computational bottlenecks and memory constraints. Implementing effective strategies to address these issues requires a better understanding of these 'black box' models. In this paper, we perform the first neuron-level analysis for source code models to identify \textit{important} neurons within latent representations. We achieve this by eliminating neurons that are highly similar or irrelevant to the given task. This approach helps us understand which neurons and layers can be eliminated (redundancy analysis) and where important code properties are located within the network (concept analysis). Using redundancy analysis, we make observations relevant to knowledge transfer and model optimization applications. We find that over 95\% of the neurons are redundant with respect to our code intelligence tasks and can be eliminated without significant loss in accuracy. We also discover several subsets of neurons that can make predictions with baseline accuracy. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts within latent code representations which could be used to influence model predictions. We trace individual and subsets of important neurons to specific code properties and identify 'number' neurons, 'string' neurons, and higher-level 'text' neurons for token-level tasks and higher-level concepts important for sentence-level downstream tasks. This also helps us understand how decomposable and transferable task-related features are and can help devise better techniques for transfer learning, model compression, and the decomposition of deep neural networks into modules.
Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.
Recently, Mutual Information (MI) has attracted attention in bounding the generalization error of Deep Neural Networks (DNNs). However, it is intractable to accurately estimate the MI in DNNs, thus most previous works have to relax the MI bound, which in turn weakens the information theoretic explanation for generalization. To address the limitation, this paper introduces a probabilistic representation of DNNs for accurately estimating the MI. Leveraging the proposed MI estimator, we validate the information theoretic explanation for generalization, and derive a tighter generalization bound than the state-of-the-art relaxations.
Object detectors usually achieve promising results with the supervision of complete instance annotations. However, their performance is far from satisfactory with sparse instance annotations. Most existing methods for sparsely annotated object detection either re-weight the loss of hard negative samples or convert the unlabeled instances into ignored regions to reduce the interference of false negatives. We argue that these strategies are insufficient since they can at most alleviate the negative effect caused by missing annotations. In this paper, we propose a simple but effective mechanism, called Co-mining, for sparsely annotated object detection. In our Co-mining, two branches of a Siamese network predict the pseudo-label sets for each other. To enhance multi-view learning and better mine unlabeled instances, the original image and corresponding augmented image are used as the inputs of two branches of the Siamese network, respectively. Co-mining can serve as a general training mechanism applied to most of modern object detectors. Experiments are performed on MS COCO dataset with three different sparsely annotated settings using two typical frameworks: anchor-based detector RetinaNet and anchor-free detector FCOS. Experimental results show that our Co-mining with RetinaNet achieves 1.4%~2.1% improvements compared with different baselines and surpasses existing methods under the same sparsely annotated setting.
Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.