Open-domain dialogue systems aim to converse with humans through text, and its research has heavily relied on benchmark datasets. In this work, we first identify the overlapping problem in DailyDialog and OpenSubtitles, two popular open-domain dialogue benchmark datasets. Our systematic analysis then shows that such overlapping can be exploited to obtain fake state-of-the-art performance. Finally, we address this issue by cleaning these datasets and setting up a proper data processing procedure for future research.
Response selector is an essential component of generation-based dialogue systems and it aims to pick out an optimal response in a candidate pool to continue the dialogue. The current state-of-the-art methods are mainly based on the encoding paradigm called Cross-Encoder, which separately encodes each context-response pair and ranks the responses according to their fitness scores. However, Cross-Encoder repeatedly encodes the same lengthy context for each response, resulting in high computational costs. Moreover, without considering the relationship among the candidates, it is difficult to figure out which candidate is the best response purely based on the fitness score per candidate. We aim to address these problems through a new paradigm called Panoramic-Encoder. The proposed method encodes all candidates and the context at once and realizes the mutual interaction using a tailored candidate attention mechanism (CAM). It also enables the integration of some effective training techniques, such as the in-batch negative training, which cannot be used in Cross-Encoders. Extensive experiments across four benchmark datasets show that our new method significantly outperforms the current state-of-the-art with lower computational complexity.
To diversify and enrich generated dialogue responses, knowledge-grounded dialogue has been investigated in recent years. The existing methods tackle the knowledge grounding challenge by retrieving the relevant sentences over a large corpus and augmenting the dialogues with explicit extra information. Despite their success, however, the existing works have drawbacks on the inference efficiency. This paper proposes KnowExpert, an end-to-end framework to bypass the explicit retrieval process and inject knowledge into the pre-trained language models with lightweight adapters and adapt to the knowledge-grounded dialogue task. To the best of our knowledge, this is the first attempt to tackle this challenge without retrieval in this task under an open-domain chit-chat scenario. The experimental results show that KknowExpert performs comparably with some retrieval-based baselines while being time-efficient in inference, demonstrating the potential of our proposed direction.
Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type. It has been widely used in many real-world applications. Recently, the vision-language pre-trained models represented by CLIP demonstrate its superiority in learning the visual and textual representations and gain impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in the unsupervised CMR, the performance and impact of these pre-trained models on the supervised CMR were rarely explored due to the lack of common representation for the multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study. We evaluate its performance and impact on the supervised CMR, and attempt to answer several key research questions. To this end, we first propose a novel model CLIP4CMR (CLIP enhanced network for Cross-Modal Retrieval) that employs the pre-trained CLIP as backbone network to perform the supervised CMR. Then by means of the CLIP4CMR framework, we revisit the design of different learning objectives in current CMR methods to provide new insights on model design. Moreover, we investigate the most concerned aspects in applying CMR, including the robustness to modality imbalance and sensitivity to hyper-parameters, to provide new perspectives for practical applications. Through extensive experiments, we show that CLIP4CMR achieves the SOTA results with prominent improvements on the benchmark datasets, and can be used as a fundamental framework to empirically study the key research issues of the supervised CMR, with significant implications for model design and practical considerations.
Crowd-sourcing is a powerful solution for finding correct answers to expensive and unanswered queries in databases, including those with uncertain and incomplete data. Attempts to use crowd-sourcing to exploit human abilities to process these expensive queries using human workers have helped to provide accurate results by utilising the available data in the crowd. Crowd-sourcing database systems (CSDBs) combine the knowledge of the crowd with a relational database by using some variant of a relational database with minor changes. This paper surveys the leading studies conducted in the area of query processing with regard to both traditional and preference queries in CSDBs. The focus of this work is on highlighting the strengths and the weakness of each approach. A detailed discussion of current and future trends research associated with query processing in the area of CSDBs is also presented.
Medical Visual Question Answering (VQA) is a combination of medical artificial intelligence and popular VQA challenges. Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer. Although the general-domain VQA has been extensively studied, the medical VQA still needs specific investigation and exploration due to its task features. In the first part of this survey, we cover and discuss the publicly available medical VQA datasets up to date about the data source, data quantity, and task feature. In the second part, we review the approaches used in medical VQA tasks. In the last part, we analyze some medical-specific challenges for the field and discuss future research directions.
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models will be made publicly available.
Human doctors with well-structured medical knowledge can diagnose a disease merely via a few conversations with patients about symptoms. In contrast, existing knowledge-grounded dialogue systems often require a large number of dialogue instances to learn as they fail to capture the correlations between different diseases and neglect the diagnostic experience shared among them. To address this issue, we propose a more natural and practical paradigm, i.e., low-resource medical dialogue generation, which can transfer the diagnostic experience from source diseases to target ones with a handful of data for adaptation. It is capitalized on a commonsense knowledge graph to characterize the prior disease-symptom relations. Besides, we develop a Graph-Evolving Meta-Learning (GEML) framework that learns to evolve the commonsense graph for reasoning disease-symptom correlations in a new disease, which effectively alleviates the needs of a large number of dialogues. More importantly, by dynamically evolving disease-symptom graphs, GEML also well addresses the real-world challenges that the disease-symptom correlations of each disease may vary or evolve along with more diagnostic cases. Extensive experiment results on the CMDD dataset and our newly-collected Chunyu dataset testify the superiority of our approach over state-of-the-art approaches. Besides, our GEML can generate an enriched dialogue-sensitive knowledge graph in an online manner, which could benefit other tasks grounded on knowledge graph.
User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, {\em predictive engagement}, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.
In this paper, we propose Latent Relation Language Models (LRLMs), a class of language models that parameterizes the joint distribution over the words in a document and the entities that occur therein via knowledge graph relations. This model has a number of attractive properties: it not only improves language modeling performance, but is also able to annotate the posterior probability of entity spans for a given text through relations. Experiments demonstrate empirical improvements over both a word-based baseline language model and a previous approach that incorporates knowledge graph information. Qualitative analysis further demonstrates the proposed model's ability to learn to predict appropriate relations in context.
Commonsense knowledge and commonsense reasoning are some of the main bottlenecks in machine intelligence. In the NLP community, many benchmark datasets and tasks have been created to address commonsense reasoning for language understanding. These tasks are designed to assess machines' ability to acquire and learn commonsense knowledge in order to reason and understand natural language text. As these tasks become instrumental and a driving force for commonsense research, this paper aims to provide an overview of existing tasks and benchmarks, knowledge resources, and learning and inference approaches toward commonsense reasoning for natural language understanding. Through this, our goal is to support a better understanding of the state of the art, its limitations, and future challenges.