Open-ended questions in surveys are valuable because they do not constrain the respondent's answer, thereby avoiding biases. However, answers to open-ended questions are text data which are harder to analyze. Traditionally, answers were manually classified as specified in the coding manual. Most of the effort to automate coding has gone into the easier problem of single label prediction, where answers are classified into a single code. However, open-ends that require multi-label classification, i.e., that are assigned multiple codes, occur frequently. This paper focuses on multi-label classification of text answers to open-ended survey questions in social science surveys. We evaluate the performance of the transformer-based architecture BERT for the German language in comparison to traditional multi-label algorithms (Binary Relevance, Label Powerset, ECC) in a German social science survey, the GLES Panel (N=17,584, 55 labels). We find that classification with BERT (forcing at least one label) has the smallest 0/1 loss (13.1%) among methods considered (18.9%-21.6%). As expected, it is much easier to correctly predict answer texts that correspond to a single label (7.1% loss) than those that correspond to multiple labels ($\sim$50% loss). Because BERT predicts zero labels for only 1.5% of the answers, forcing at least one label, while recommended, ultimately does not lower the 0/1 loss by much. Our work has important implications for social scientists: 1) We have shown multi-label classification with BERT works in the German language for open-ends. 2) For mildly multi-label classification tasks, the loss now appears small enough to allow for fully automatic classification (as compared to semi-automatic approaches). 3) Multi-label classification with BERT requires only a single model. The leading competitor, ECC, iterates through individual single label predictions.
Recently, with the chain of thought (CoT) prompting, large language models (LLMs), e.g., GPT-3, have shown strong reasoning ability in several natural language processing tasks such as arithmetic, commonsense, and logical reasoning. However, LLMs with CoT require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes and vulnerable to error accumulation. The above issues make the LLMs need the ability to verify the answers. In fact, after inferring conclusions in some thinking decision tasks, people often check them by re-verifying steps to avoid some mistakes. In this paper, we propose and prove that LLMs also have similar self-verification abilities. We take the conclusion obtained by CoT as one of the conditions for solving the original problem. By taking turns masking the original conditions and predicting their results, we calculate an explainable answer verification score based on whether the re-predicted conditions are correct. Experimental results demonstrate that the proposed method can improve the reasoning performance on various arithmetic, commonsense, and logical reasoning datasets. Our code is publicly available at: //github.com/WENGSYX/Self-Verification.
Scientific literature understanding tasks have gained significant attention due to their potential to accelerate scientific discovery. Pre-trained language models (LMs) have shown effectiveness in these tasks, especially when tuned via contrastive learning. However, jointly utilizing pre-training data across multiple heterogeneous tasks (e.g., extreme classification, citation prediction, and literature search) remains largely unexplored. To bridge this gap, we propose a multi-task contrastive learning framework, SciMult, with a focus on facilitating common knowledge sharing across different scientific literature understanding tasks while preventing task-specific skills from interfering with each other. To be specific, we explore two techniques -- task-aware specialization and instruction tuning. The former adopts a Mixture-of-Experts Transformer architecture with task-aware sub-layers; the latter prepends task-specific instructions to the input text so as to produce task-aware outputs. Extensive experiments on a comprehensive collection of benchmark datasets verify the effectiveness of our task-aware specialization strategy in various tasks, where we outperform state-of-the-art scientific LMs.
Scientific fact-checking is crucial for ensuring the accuracy, reliability, and trustworthiness of scientific claims. However, existing benchmarks are limited in terms of their claim diversity, reliance on text-based evidence, and oversimplification of scientific reasoning. To address these gaps, we introduce SCITAB, a novel dataset comprising 1,225 challenging scientific claims requiring compositional reasoning with scientific tables. The claims in SCITAB are derived from the actual scientific statements, and the evidence is presented as tables, closely mirroring real-world fact-checking scenarios. We establish benchmarks on SCITAB using state-of-the-art models, revealing its inherent difficulty and highlighting limitations in existing prompting methods. Our error analysis identifies unique challenges, including ambiguous expressions and irrelevant claims, suggesting future research directions. The code and the data are publicly available at //github.com/XinyuanLu00/SciTab.
Multi-label text classification (MLC) is a challenging task in settings of large label sets, where label support follows a Zipfian distribution. In this paper, we address this problem through retrieval augmentation, aiming to improve the sample efficiency of classification models. Our approach closely follows the standard MLC architecture of a Transformer-based encoder paired with a set of classification heads. In our case, however, the input document representation is augmented through cross-attention to similar documents retrieved from the training set and represented in a task-specific manner. We evaluate this approach on four datasets from the legal and biomedical domains, all of which feature highly skewed label distributions. Our experiments show that retrieval augmentation substantially improves model performance on the long tail of infrequent labels especially so for lower-resource training scenarios and more challenging long-document data scenarios.
With contrastive pre-training, sentence encoders are generally optimized to locate semantically similar samples closer to each other in their embedding spaces. In this work, we focus on the potential of their embedding spaces to be readily adapted to zero-shot text classification, as semantically distinct samples are already well-separated. Our framework, RaLP (Retrieval augmented Label Prompts for sentence encoder), encodes prompted label candidates with a sentence encoder, then assigns the label whose prompt embedding has the highest similarity with the input text embedding. In order to compensate for the potentially poorly descriptive labels in their original format, RaLP retrieves sentences that are semantically similar to the original label prompt from external corpora and use them as additional pseudo-label prompts. RaLP achieves competitive or stronger performance than much larger baselines on various closed-set classification and multiple-choice QA datasets under zero-shot settings. We show that the retrieval component plays a pivotal role in RaLP's success, and its results are robustly attained regardless of verbalizer variations.
The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has, without exaggeration, revolutionized the fields of natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage ranking architectures and learned dense representations that attempt to perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond the typical sentence-by-sentence processing approaches used in NLP, and techniques for addressing the tradeoff between effectiveness (result quality) and efficiency (query latency). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading.
Extreme multi-label text classification (XMC) aims to tag each input text with the most relevant labels from an extremely large label set, such as those that arise in product categorization and e-commerce recommendation. Recently, pretrained language representation models such as BERT achieve remarkable state-of-the-art performance across a wide range of NLP tasks including sentence classification among small label sets (typically fewer than thousands). Indeed, there are several challenges in applying BERT to the XMC problem. The main challenges are: (i) the difficulty of capturing dependencies and correlations among labels, whose features may come from heterogeneous sources, and (ii) the tractability to scale to the extreme label setting as the model size can be very large and scale linearly with the size of the output space. To overcome these challenges, we propose X-BERT, the first feasible attempt to finetune BERT models for a scalable solution to the XMC problem. Specifically, X-BERT leverages both the label and document text to build label representations, which induces semantic label clusters in order to better model label dependencies. At the heart of X-BERT is finetuning BERT models to capture the contextual relations between input text and the induced label clusters. Finally, an ensemble of the different BERT models trained on heterogeneous label clusters leads to our best final model. Empirically, on a Wiki dataset with around 0.5 million labels, X-BERT achieves new state-of-the-art results where the precision@1 reaches 67:80%, a substantial improvement over 32.58%/60.91% of deep learning baseline fastText and competing XMC approach Parabel, respectively. This amounts to a 11.31% relative improvement over Parabel, which is indeed significant since the recent approach SLICE only has 5.53% relative improvement.
In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed.
Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets.
In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.