In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.
The in-context learning ability of large language models (LLMs) enables them to generalize to novel downstream tasks with relatively few labeled examples. However, they require enormous computational resources to be deployed. Alternatively, smaller models can solve specific tasks if fine-tuned with enough labeled examples. These examples, however, are expensive to obtain. In pursuit of the best of both worlds, we study synthetic data generation of fine-tuning training data via fine-tuned teacher LLMs to improve the downstream performance of much smaller models. In four text classification and two text generation tasks, we find that both data generation and annotation dramatically improve the respective downstream model's performance, occasionally necessitating only a minor fraction of the original training dataset.
In this paper, we provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain rather than $\mathbb S^{d}$. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the wide neural networks uniformly approximated that of the neural tangent kernel regression on general domains, we can further illustrate the minimax optimality of the wide neural network provided that the underground truth function $f\in [\mathcal H_{\mathrm{NTK}}]^{s}$, an interpolation space associated with the RKHS $\mathcal{H}_{\mathrm{NTK}}$ of NTK. We also showed that the overfitted neural network can not generalize well. We believe our approach for determining the EDR of kernels might be also of independent interests.
General large language models (LLMs), represented by ChatGPT, have demonstrated significant potential in tasks such as code generation in software engineering. This has led to the development of specialized LLMs for software engineering, known as Code LLMs. A considerable portion of Code LLMs is derived from general LLMs through model fine-tuning. As a result, Code LLMs are often updated frequently and their performance can be influenced by the base LLMs. However, there is currently a lack of systematic investigation into Code LLMs and their performance. In this study, we conduct a comprehensive survey and analysis of the types of Code LLMs and their differences in performance compared to general LLMs. We aim to address three questions: (1) What LLMs are specifically designed for software engineering tasks, and what is the relationship between these Code LLMs? (2) Do Code LLMs really outperform general LLMs in software engineering tasks? (3) Which LLMs are more proficient in different software engineering tasks? To answer these questions, we first collect relevant literature and work from five major databases and open-source communities, resulting in 134 works for analysis. Next, we categorize the Code LLMs based on their publishers and examine their relationships with general LLMs and among themselves. Furthermore, we investigate the performance differences between general LLMs and Code LLMs in various software engineering tasks to demonstrate the impact of base models and Code LLMs. Finally, we comprehensively maintained the performance of LLMs across multiple mainstream benchmarks to identify the best-performing LLMs for each software engineering task. Our research not only assists developers of Code LLMs in choosing base models for the development of more advanced LLMs but also provides insights for practitioners to better understand key improvement directions for Code LLMs.
In this paper, we explore a continuous modeling approach for deep-learning-based speech enhancement, focusing on the denoising process. We use a state variable to indicate the denoising process. The starting state is noisy speech and the ending state is clean speech. The noise component in the state variable decreases with the change of the state index until the noise component is 0. During training, a UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. In testing, we introduce a controlling factor as an embedding, ranging from zero to one, to the neural network, allowing us to control the level of noise reduction. This approach enables controllable speech enhancement and is adaptable to various application scenarios. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement, as evidenced by improvements in both objective speech measures and automatic speech recognition performance.
In the era of large language models (LLMs), hallucination (i.e., the tendency to generate factually incorrect content) poses great challenge to trustworthy and reliable deployment of LLMs in real-world applications. To tackle the LLM hallucination, three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them (mitigation). To address these challenges, this work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation. Specially, we construct a new hallucination benchmark HaluEval 2.0, and designs a simple yet effective detection method for LLM hallucination. Furthermore, we zoom into the different training or utilization stages of LLMs and extensively analyze the potential factors that lead to the LLM hallucination. Finally, we implement and examine a series of widely used techniques to mitigate the hallucinations in LLMs. Our work has led to several important findings to understand the hallucination origin and mitigate the hallucinations in LLMs. Our code and data can be accessed at //github.com/RUCAIBox/HaluEval-2.0.
In this paper, we introduce a new family of descriptors for persistence diagrams. Our approach transforms these diagrams into elements of a finite-dimensional vector space using functionals based on the discrete measures they induce. While our focus is primarily on identity and frequency-based transformations, we do not restrict our approach exclusively to this types of techniques. We term this family of transformations as LITE (Lattice Integrated Topological Embedding) and prove stability for some members of this family against the 1-$Kantorovitch$-$Rubinstein$ metric, ensuring its responsiveness to subtle data variations. Extensive comparative analysis reveals that our descriptor performs competitively with the current state-of-art from the topological data analysis literature, and often surpasses, the existing methods. This research not only introduces an innovative perspective for data scientists but also critiques the current trajectory of literature on methodologies for vectorizing diagrams. It establishes a foundation for future progress in applying persistence diagrams to data analysis and machine learning under a more simple and effective lens.
In this paper, we conducted a comparative evaluation of three RGB-D SLAM (Simultaneous Localization and Mapping) algorithms: RTAB-Map, ORB-SLAM3, and OpenVSLAM for SURENA-V humanoid robot localization and mapping. Our test involves the robot to follow a full circular pattern, with an Intel RealSense D435 RGB-D camera installed on its head. In assessing localization accuracy, ORB-SLAM3 outperformed the others with an ATE of 0.1073, followed by RTAB-Map at 0.1641 and OpenVSLAM at 0.1847. However, it should be noted that both ORB-SLAM3 and OpenVSLAM faced challenges in maintaining accurate odometry when the robot encountered a wall with limited feature points. Nevertheless, OpenVSLAM demonstrated the ability to detect loop closures and successfully relocalize itself within the map when the robot approached its initial location. The investigation also extended to mapping capabilities, where RTAB-Map excelled by offering diverse mapping outputs, including dense, OctoMap, and occupancy grid maps. In contrast, both ORB-SLAM3 and OpenVSLAM provided only sparse maps.
In this paper, we propose an information geometry approach (IGA) for signal detection (SD) in ultra-massive multiple-input multiple-output (MIMO) systems. We formulate the signal detection as obtaining the marginals of the a posteriori probability distribution of the transmitted symbol vector. Then, a maximization of the a posteriori marginals (MPM) for signal detection can be performed. With the information geometry theory, we calculate the approximations of the a posteriori marginals. It is formulated as an iterative m-projection process between submanifolds with different constraints. We then apply the central-limit-theorem (CLT) to simplify the calculation of the m-projection since the direct calculation of the m-projection is of exponential-complexity. With the CLT, we obtain an approximate solution of the m-projection, which is asymptotically accurate. Simulation results demonstrate that the proposed IGA-SD emerges as a promising and efficient method to implement the signal detector in ultra-massive MIMO systems.
In Part II of this two-part paper, we prove the convergence of the simplified information geometry approach (SIGA) proposed in Part I. For a general Bayesian inference problem, we first show that the iteration of the common second-order natural parameter (SONP) is separated from that of the common first-order natural parameter (FONP). Hence, the convergence of the common SONP can be checked independently. We show that with the initialization satisfying a specific but large range, the common SONP is convergent regardless of the value of the damping factor. For the common FONP, we establish a sufficient condition of its convergence and prove that the convergence of the common FONP relies on the spectral radius of a particular matrix related to the damping factor. We give the range of the damping factor that guarantees the convergence in the worst case. Further, we determine the range of the damping factor for massive MIMO-OFDM channel estimation by using the specific properties of the measurement matrices. Simulation results are provided to confirm the theoretical results.
While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.