Requirements engineering (RE) plays a crucial role in developing software systems by bridging the gap between stakeholders' needs and system specifications. However, effective communication and elicitation of stakeholder requirements can be challenging, as traditional RE methods often overlook emotional cues. This paper introduces a multi-modal emotion recognition platform (MEmoRE) to enhance the requirements engineering process by capturing and analyzing the emotional cues of stakeholders in real-time. MEmoRE leverages state-of-the-art emotion recognition techniques, integrating facial expression, vocal intonation, and textual sentiment analysis to comprehensively understand stakeholder emotions. This multi-modal approach ensures the accurate and timely detection of emotional cues, enabling requirements engineers to tailor their elicitation strategies and improve overall communication with stakeholders. We further intend to employ our platform for later RE stages, such as requirements reviews and usability testing. By integrating multi-modal emotion recognition into requirements engineering, we aim to pave the way for more empathetic, effective, and successful software development processes. We performed a preliminary evaluation of our platform. This paper reports on the platform design, preliminary evaluation, and future development plan as an ongoing project.
The most prominent subtask in emotion analysis is emotion classification; to assign a category to a textual unit, for instance a social media post. Many research questions from the social sciences do, however, not only require the detection of the emotion of an author of a post but to understand who is ascribed an emotion in text. This task is tackled by emotion role labeling which aims at extracting who is described in text to experience an emotion, why, and towards whom. This could, however, be considered overly sophisticated if the main question to answer is who feels which emotion. A targeted approach for such setup is to classify emotion experiencer mentions (aka "emoters") regarding the emotion they presumably perceive. This task is similar to named entity recognition of person names with the difference that not every mentioned entity name is an emoter. While, very recently, data with emoter annotations has been made available, no experiments have yet been performed to detect such mentions. With this paper, we provide baseline experiments to understand how challenging the task is. We further evaluate the impact on experiencer-specific emotion categorization and appraisal detection in a pipeline, when gold mentions are not available. We show that experiencer detection in text is a challenging task, with a precision of .82 and a recall of .56 (F1 =.66). These results motivate future work of jointly modeling emoter spans and emotion/appraisal predictions.
Weak alignment of requirements engineering (RE) with verification and validation (VV) may lead to problems in delivering the required products in time with the right quality. For example, weak communication of requirements changes to testers may result in lack of verification of new requirements and incorrect verification of old invalid requirements, leading to software quality problems, wasted effort and delays. However, despite the serious implications of weak alignment research and practice both tend to focus on one or the other of RE or VV rather than on the alignment of the two. We have performed a multi-unit case study to gain insight into issues around aligning RE and VV by interviewing 30 practitioners from 6 software developing companies, involving 10 researchers in a flexible research process for case studies. The results describe current industry challenges and practices in aligning RE with VV, ranging from quality of the individual RE and VV activities, through tracing and tools, to change control and sharing a common understanding at strategy, goal and design level. The study identified that human aspects are central, i.e. cooperation and communication, and that requirements engineering practices are a critical basis for alignment. Further, the size of an organisation and its motivation for applying alignment practices, e.g. external enforcement of traceability, are variation factors that play a key role in achieving alignment. Our results provide a strategic roadmap for practitioners improvement work to address alignment challenges. Furthermore, the study provides a foundation for continued research to improve the alignment of RE with VV.
Requirements Engineering and Software Testing are mature areas and have seen a lot of research. Nevertheless, their interactions have been sparsely explored beyond the concept of traceability. To fill this gap, we propose a definition of requirements engineering and software test (REST) alignment, a taxonomy that characterizes the methods linking the respective areas, and a process to assess alignment. The taxonomy can support researchers to identify new opportunities for investigation, as well as practitioners to compare alignment methods and evaluate alignment, or lack thereof. We constructed the REST taxonomy by analyzing alignment methods published in literature, iteratively validating the emerging dimensions. The resulting concept of an information dyad characterizes the exchange of information required for any alignment to take place. We demonstrate use of the taxonomy by applying it on five in-depth cases and illustrate angles of analysis on a set of thirteen alignment methods. In addition, we developed an assessment framework (REST-bench), applied it in an industrial assessment, and showed that it, with a low effort, can identify opportunities to improve REST alignment. Although we expect that the taxonomy can be further refined, we believe that the information dyad is a valid and useful construct to understand alignment.
Speech Emotion Recognition (SER) is a challenging task. In this paper, we introduce a modality conversion concept aimed at enhancing emotion recognition performance on the MELD dataset. We assess our approach through two experiments: first, a method named Modality-Conversion that employs automatic speech recognition (ASR) systems, followed by a text classifier; second, we assume perfect ASR output and investigate the impact of modality conversion on SER, this method is called Modality-Conversion++. Our findings indicate that the first method yields substantial results, while the second method outperforms state-of-the-art (SOTA) speech-based approaches in terms of SER weighted-F1 (WF1) score on the MELD dataset. This research highlights the potential of modality conversion for tasks that can be conducted in alternative modalities.
The number of modes in a probability density function is representative of the model's complexity and can also be viewed as the number of existing subpopulations. Despite its relevance, little research has been devoted to its estimation. Focusing on the univariate setting, we propose a novel approach targeting prediction accuracy inspired by some overlooked aspects of the problem. We argue for the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view blending global and local density properties. Our method builds upon a combination of flexible kernel estimators and parsimonious compositional splines. Feature exploration, model selection and mode testing are implemented in the Bayesian inference paradigm, providing soft solutions and allowing to incorporate expert judgement in the process. The usefulness of our proposal is illustrated through a case study in sports analytics, showcasing multiple companion visualisation tools. A thorough simulation study demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, our method emerges as a top-tier alternative offering innovative solutions for analysts.
Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code will be released at //github.com/mondalanindya/MSQNet.
This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.
Emotion recognition in conversation (ERC) aims to detect the emotion label for each utterance. Motivated by recent studies which have proven that feeding training examples in a meaningful order rather than considering them randomly can boost the performance of models, we propose an ERC-oriented hybrid curriculum learning framework. Our framework consists of two curricula: (1) conversation-level curriculum (CC); and (2) utterance-level curriculum (UC). In CC, we construct a difficulty measurer based on "emotion shift" frequency within a conversation, then the conversations are scheduled in an "easy to hard" schema according to the difficulty score returned by the difficulty measurer. For UC, it is implemented from an emotion-similarity perspective, which progressively strengthens the model's ability in identifying the confusing emotions. With the proposed model-agnostic hybrid curriculum learning strategy, we observe significant performance boosts over a wide range of existing ERC models and we are able to achieve new state-of-the-art results on four public ERC datasets.
Emotion plays an important role in detecting fake news online. When leveraging emotional signals, the existing methods focus on exploiting the emotions of news contents that conveyed by the publishers (i.e., publisher emotion). However, fake news is always fabricated to evoke high-arousal or activating emotions of people to spread like a virus, so the emotions of news comments that aroused by the crowd (i.e., social emotion) can not be ignored. Furthermore, it needs to be explored whether there exists a relationship between publisher emotion and social emotion (i.e., dual emotion), and how the dual emotion appears in fake news. In the paper, we propose Dual Emotion Features to mine dual emotion and the relationship between them for fake news detection. And we design a universal paradigm to plug it into any existing detectors as an enhancement. Experimental results on three real-world datasets indicate the effectiveness of the proposed features.
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.