The deluge of new papers has significantly blocked the development of academics, which is mainly caused by author-level and publication-level evaluation metrics that only focus on quantity. Those metrics have resulted in several severe problems that trouble scholars focusing on the important research direction for a long time and even promote an impetuous academic atmosphere. To solve those problems, we propose Phocus, a novel academic evaluation mechanism for authors and papers. Phocus analyzes the sentence containing a citation and its contexts to predict the sentiment towards the corresponding reference. Combining others factors, Phocus classifies citations coarsely, ranks all references within a paper, and utilizes the results of the classifier and the ranking model to get the local influential factor of a reference to the citing paper. The global influential factor of the reference to the citing paper is the product of the local influential factor and the total influential factor of the citing paper. Consequently, an author's academic influential factor is the sum of his contributions to each paper he co-authors.
Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year's WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of Independent and Identically Distributed (i.i.d) assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to claim the result on a single dataset, because it may leads to inconsistent results with most of other datasets.
With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is mined from software projects, however, requires extensive processing and needs to be handled with utmost care to ensure valid conclusions. Since the software development practices and tools have changed over two decades, we aim to understand the state-of-the-art research workflows and to highlight potential challenges. We employ a systematic literature review by sampling over one thousand papers from leading conferences and by analyzing the 286 most relevant papers from the perspective of data workflows, methodologies, reproducibility, and tools. We found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature. Furthermore, we found a considerable number of papers provide little or no reproducibility instructions -- a substantial deficiency for a data-intensive field. In fact, 33% of papers provide no information on how their data was retrieved. Based on these findings, we propose ways to address these shortcomings via existing tools and also provide recommendations to improve research workflows and the reproducibility of research.
In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer encoder with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.
We review the scholarly contributions that utilise Natural Language Processing (NLP) techniques to support the design process. Using a heuristic approach, we gathered 223 articles that are published in 32 journals within the period 1991-present. We present state-of-the-art NLP in-and-for design research by reviewing these articles according to the type of natural language text sources: internal reports, design concepts, discourse transcripts, technical publications, consumer opinions, and others. Upon summarizing and identifying the gaps in these contributions, we utilise an existing design innovation framework to identify the applications that are currently being supported by NLP. We then propose a few methodological and theoretical directions for future NLP in-and-for design research.
Biomedical Question Answering (BQA) has attracted increasing attention in recent years due to its promising application prospect. It is a challenging task because the biomedical questions are professional and usually vary widely. Existing question answering methods answer all questions with a homogeneous model, leading to various types of questions competing for the shared parameters, which will confuse the model decision for each single type of questions. In this paper, in order to alleviate the parameter competition problem, we propose a Mixture-of-Expert (MoE) based question answering method called MoEBQA that decouples the computation for different types of questions by sparse routing. To be specific, we split a pretrained Transformer model into bottom and top blocks. The bottom blocks are shared by all the examples, aiming to capture the general features. The top blocks are extended to an MoE version that consists of a series of independent experts, where each example is assigned to a few experts according to its underlying question type. MoEBQA automatically learns the routing strategy in an end-to-end manner so that each expert tends to deal with the question types it is expert in. We evaluate MoEBQA on three BQA datasets constructed based on real examinations. The results show that our MoE extension significantly boosts the performance of question answering models and achieves new state-of-the-art performance. In addition, we elaborately analyze our MoE modules to reveal how MoEBQA works and find that it can automatically group the questions into human-readable clusters.
After spending 9 years in Quantum Computing and given the impending timeline of developing good quality quantum processing units, it is the moment to rethink the approach to advance quantum computing research. Rather than waiting for quantum hardware technologies to mature, we need to start assessing in tandem the impact of the occurrence of quantum computing in various scientific fields. However, for this purpose, we need to use a complementary but quite different approach than proposed by the NISQ vision, which is heavily focused on and burdened by the engineering challenges. That is why we propose and advocate the PISQ-approach: Perfect Intermediate-Scale Quantum computing based on the already known concept of perfect qubits. This will allow researchers to focus much more on the development of new applications by defining the algorithms in terms of perfect qubits and evaluating them on quantum computing simulators that are executed on supercomputers. It is not a long-term solution but it will allow universities to currently develop research on quantum logic and algorithms and companies can already start developing their internal know-how on quantum solutions.
Proactive dialogue system is able to lead the conversation to a goal topic and has advantaged potential in bargain, persuasion and negotiation. Current corpus-based learning manner limits its practical application in real-world scenarios. To this end, we contribute to advance the study of the proactive dialogue policy to a more natural and challenging setting, i.e., interacting dynamically with users. Further, we call attention to the non-cooperative user behavior -- the user talks about off-path topics when he/she is not satisfied with the previous topics introduced by the agent. We argue that the targets of reaching the goal topic quickly and maintaining a high user satisfaction are not always converge, because the topics close to the goal and the topics user preferred may not be the same. Towards this issue, we propose a new solution named I-Pro that can learn Proactive policy in the Interactive setting. Specifically, we learn the trade-off via a learned goal weight, which consists of four factors (dialogue turn, goal completion difficulty, user satisfaction estimation, and cooperative degree). The experimental results demonstrate I-Pro significantly outperforms baselines in terms of effectiveness and interpretability.
Correlation acts as a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion manner to consider the similarity between the template and the search region. However, the correlation operation itself is a local linear matching process, leading to lose semantic information and fall into local optimum easily, which may be the bottleneck of designing high-accuracy tracking algorithms. Is there any better feature fusion method than correlation? To address this issue, inspired by Transformer, this work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. Finally, we present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Experiments show that our TransT achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks. Our tracker runs at approximatively 50 fps on GPU. Code and models are available at //github.com/chenxin-dlut/TransT.
Few sample learning (FSL) is significant and challenging in the field of machine learning. The capability of learning and generalizing from very few samples successfully is a noticeable demarcation separating artificial intelligence and human intelligence since humans can readily establish their cognition to novelty from just a single or a handful of examples whereas machine learning algorithms typically entail hundreds or thousands of supervised samples to guarantee generalization ability. Despite the long history dated back to the early 2000s and the widespread attention in recent years with booming deep learning technologies, little surveys or reviews for FSL are available until now. In this context, we extensively review 200+ papers of FSL spanning from the 2000s to 2019 and provide a timely and comprehensive survey for FSL. In this survey, we review the evolution history as well as the current progress on FSL, categorize FSL approaches into the generative model based and discriminative model based kinds in principle, and emphasize particularly on the meta learning based FSL approaches. We also summarize several recently emerging extensional topics of FSL and review the latest advances on these topics. Furthermore, we highlight the important FSL applications covering many research hotspots in computer vision, natural language processing, audio and speech, reinforcement learning and robotic, data analysis, etc. Finally, we conclude the survey with a discussion on promising trends in the hope of providing guidance and insights to follow-up researches.
User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, {\em predictive engagement}, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.