Gaze cueing is a fundamental part of social interactions, and broadly studied using Posner task based gaze cueing paradigms. While studies using human stimuli consistently yield a gaze cueing effect, results from studies using robotic stimuli are inconsistent. Typically, these studies use virtual agents or pictures of robots. As previous research has pointed to the significance of physical presence in human-robot interaction, it is of fundamental importance to understand its yet unexplored role in interactions with gaze cues. This paper investigates whether the physical presence of the iCub humanoid robot affects the strength of the gaze cueing effect in human-robot interaction. We exposed 42 participants to a gaze cueing task. We asked participants to react as quickly and accurately as possible to the appearance of a target stimulus that was either congruently or incongruently cued by the gaze of a copresent iCub robot or a virtual version of the same robot. Analysis of the reaction time measurements showed that participants were consistently affected by their robot interaction partner's gaze, independently on the way the robot was presented. Additional analyses of participants' ratings of the robot's anthropomorphism, animacy and likeability further add to the impression that presence does not play a significant role in simple gaze based interactions. Together our findings open up interesting discussions about the possibility to generalize results from studies using virtual agents to real life interactions with copresent robots.
Social Robotics and Human-Robot Interaction (HRI) research relies on different Affective Computing (AC) solutions for sensing, perceiving and understanding human affective behaviour during interactions. This may include utilising off-the-shelf affect perception models that are pre-trained on popular affect recognition benchmarks and directly applied to situated interactions. However, the conditions in situated human-robot interactions differ significantly from the training data and settings of these models. Thus, there is a need to deepen our understanding of how AC solutions can be best leveraged, customised and applied for situated HRI. This paper, while critiquing the existing practices, presents four critical lessons to be noted by the hitchhiker when applying AC for HRI research. These lessons conclude that: (i) The six basic emotions categories are irrelevant in situated interactions, (ii) Affect recognition accuracy (%) improvements are unimportant, (iii) Affect recognition does not generalise across contexts, and (iv) Affect recognition alone is insufficient for adaptation and personalisation. By describing the background and the context for each lesson, and demonstrating how these lessons have been learnt, this paper aims to enable the hitchhiker to successfully and insightfully leverage AC solutions for advancing HRI research.
Robots have become ubiquitous tools in various industries and households, highlighting the importance of human-robot interaction (HRI). This has increased the need for easy and accessible communication between humans and robots. Recent research has focused on the intersection of virtual assistant technology, such as Amazon's Alexa, with robots and its effect on HRI. This paper presents the Virtual Assistant, Human, and Robots in the loop (VAHR) system, which utilizes bidirectional communication to control multiple robots through Alexa. VAHR's performance was evaluated through a human-subjects experiment, comparing objective and subjective metrics of traditional keyboard and mouse interfaces to VAHR. The results showed that VAHR required 41% less Robot Attention Demand and ensured 91% more Fan-out time compared to the standard method. Additionally, VAHR led to a 62.5% improvement in multi-tasking, highlighting the potential for efficient human-robot interaction in physically- and mentally-demanding scenarios. However, subjective metrics revealed a need for human operators to build confidence and trust with this new method of operation.
Humans naturally change their environment through interactions, e.g., by opening doors or moving furniture. To reproduce such interactions in virtual spaces (e.g., metaverse), we need to capture and model them, including changes in the scene geometry, ideally from egocentric input alone (head camera and body-worn inertial sensors). While the head camera can be used to localize the person in the scene, estimating dynamic object pose is much more challenging. As the object is often not visible from the head camera (e.g., a human not looking at a chair while sitting down), we can not rely on visual object pose estimation. Instead, our key observation is that human motion tells us a lot about scene changes. Motivated by this, we present iReplica, the first human-object interaction reasoning method which can track objects and scene changes based solely on human motion. iReplica is an essential first step towards advanced AR/VR applications in immersive virtual universes and can provide human-centric training data to teach machines to interact with their surroundings. Our code, data and model will be available on our project page at //virtualhumans.mpi-inf.mpg.de/ireplica/
With the most advanced natural language processing and artificial intelligence approaches, effective summarization of long and multi-topic documents -- such as academic papers -- for readers from different domains still remains a challenge. To address this, we introduce ConceptEVA, a mixed-initiative approach to generate, evaluate, and customize summaries for long and multi-topic documents. ConceptEVA incorporates a custom multi-task longformer encoder decoder to summarize longer documents. Interactive visualizations of document concepts as a network reflecting both semantic relatedness and co-occurrence help users focus on concepts of interest. The user can select these concepts and automatically update the summary to emphasize them. We present two iterations of ConceptEVA evaluated through an expert review and a within-subjects study. We find that participants' satisfaction with customized summaries through ConceptEVA is higher than their own manually-generated summary, while incorporating critique into the summaries proved challenging. Based on our findings, we make recommendations for designing summarization systems incorporating mixed-initiative interactions.
Existing conversational models are handled by a database(DB) and API based systems. However, very often users' questions require information that cannot be handled by such systems. Nonetheless, answers to these questions are available in the form of customer reviews and FAQs. DSTC-11 proposes a three stage pipeline consisting of knowledge seeking turn detection, knowledge selection and response generation to create a conversational model grounded on this subjective knowledge. In this paper, we focus on improving the knowledge selection module to enhance the overall system performance. In particular, we propose entity retrieval methods which result in an accurate and faster knowledge search. Our proposed Named Entity Recognition (NER) based entity retrieval method results in 7X faster search compared to the baseline model. Additionally, we also explore a potential keyword extraction method which can improve the accuracy of knowledge selection. Preliminary results show a 4 \% improvement in exact match score on knowledge selection task. The code is available //github.com/raja-kumar/knowledge-grounded-TODS
Attention (and distraction) recognition is a key factor in improving human-robot collaboration. We present an assembly scenario where a human operator and a cobot collaborate equally to piece together a gearbox. The setup provides multiple opportunities for the cobot to adapt its behavior depending on the operator's attention, which can improve the collaboration experience and reduce psychological strain. As a first step, we recognize the areas in the workspace that the human operator is paying attention to, and consequently, detect when the operator is distracted. We propose a novel deep-learning approach to develop an attention recognition model. First, we train a convolutional neural network to estimate the gaze direction using a publicly available image dataset. Then, we use transfer learning with a small dataset to map the gaze direction onto pre-defined areas of interest. Models trained using this approach performed very well in leave-one-subject-out evaluation on the small dataset. We performed an additional validation of our models using the video snippets collected from participants working as an operator in the presented assembly scenario. Although the recall for the Distracted class was lower in this case, the models performed well in recognizing the areas the operator paid attention to. To the best of our knowledge, this is the first work that validated an attention recognition model using data from a setting that mimics industrial human-robot collaboration. Our findings highlight the need for validation of attention recognition solutions in such full-fledged, non-guided scenarios.
Effective multi-robot teams require the ability to move to goals in complex environments in order to address real-world applications such as search and rescue. Multi-robot teams should be able to operate in a completely decentralized manner, with individual robot team members being capable of acting without explicit communication between neighbors. In this paper, we propose a novel game theoretic model that enables decentralized and communication-free navigation to a goal position. Robots each play their own distributed game by estimating the behavior of their local teammates in order to identify behaviors that move them in the direction of the goal, while also avoiding obstacles and maintaining team cohesion without collisions. We prove theoretically that generated actions approach a Nash equilibrium, which also corresponds to an optimal strategy identified for each robot. We show through extensive simulations that our approach enables decentralized and communication-free navigation by a multi-robot system to a goal position, and is able to avoid obstacles and collisions, maintain connectivity, and respond robustly to sensor noise.
Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
Detection and recognition of text in natural images are two main problems in the field of computer vision that have a wide variety of applications in analysis of sports videos, autonomous driving, industrial automation, to name a few. They face common challenging problems that are factors in how text is represented and affected by several environmental conditions. The current state-of-the-art scene text detection and/or recognition methods have exploited the witnessed advancement in deep learning architectures and reported a superior accuracy on benchmark datasets when tackling multi-resolution and multi-oriented text. However, there are still several remaining challenges affecting text in the wild images that cause existing methods to underperform due to there models are not able to generalize to unseen data and the insufficient labeled data. Thus, unlike previous surveys in this field, the objectives of this survey are as follows: first, offering the reader not only a review on the recent advancement in scene text detection and recognition, but also presenting the results of conducting extensive experiments using a unified evaluation framework that assesses pre-trained models of the selected methods on challenging cases, and applies the same evaluation criteria on these techniques. Second, identifying several existing challenges for detecting or recognizing text in the wild images, namely, in-plane-rotation, multi-oriented and multi-resolution text, perspective distortion, illumination reflection, partial occlusion, complex fonts, and special characters. Finally, the paper also presents insight into the potential research directions in this field to address some of the mentioned challenges that are still encountering scene text detection and recognition techniques.
Reinforcement learning is one of the core components in designing an artificial intelligent system emphasizing real-time response. Reinforcement learning influences the system to take actions within an arbitrary environment either having previous knowledge about the environment model or not. In this paper, we present a comprehensive study on Reinforcement Learning focusing on various dimensions including challenges, the recent development of different state-of-the-art techniques, and future directions. The fundamental objective of this paper is to provide a framework for the presentation of available methods of reinforcement learning that is informative enough and simple to follow for the new researchers and academics in this domain considering the latest concerns. First, we illustrated the core techniques of reinforcement learning in an easily understandable and comparable way. Finally, we analyzed and depicted the recent developments in reinforcement learning approaches. My analysis pointed out that most of the models focused on tuning policy values rather than tuning other things in a particular state of reasoning.