Hand gesture is a new and promising interface for locomotion in virtual environments. While several previous studies have proposed different hand gestures for virtual locomotion, little is known about their differences in terms of performance and user preference in virtual locomotion tasks. In the present paper, we presented three different hand gesture interfaces and their algorithms for locomotion, which are called the Finger Distance gesture, the Finger Number gesture and the Finger Tapping gesture. These gestures were inspired by previous studies of gesture-based locomotion interfaces and are typical gestures that people are familiar with in their daily lives. Implementing these hand gesture interfaces in the present study enabled us to systematically compare the differences between these gestures. In addition, to compare the usability of these gestures to locomotion interfaces using gamepads, we also designed and implemented a gamepad interface based on the Xbox One controller. We conducted empirical studies to compare these four interfaces through two virtual locomotion tasks. A desktop setup was used instead of sharing a head-mounted display among participants due to the concern of the Covid-19 situation. Through these tasks, we assessed the performance and user preference of these interfaces on speed control and waypoints navigation. Results showed that user preference and performance of the Finger Distance gesture were close to that of the gamepad interface. The Finger Number gesture also had close performance and user preference to that of the Finger Distance gesture. Our study demonstrates that the Finger Distance gesture and the Finger Number gesture are very promising interfaces for virtual locomotion. We also discuss that the Finger Tapping gesture needs further improvements before it can be used for virtual walking.
The virtual reality (VR) and human-computer interaction (HCI) combination has radically changed the way users approach a virtual environment, increasing the feeling of VR immersion, and improving the user experience and usability. The evolution of these two technologies led to the focus on VR locomotion and interaction. Locomotion is generally controller-based, but today hand gesture recognition methods were also used for this purpose. However, hand gestures can be stressful for the user who has to keep the gesture activation for a long time to ensure locomotion, especially continuously. Likewise, in Head Mounted Display (HMD)-based virtual environment or Spherical-based system, the use of classic controllers for the 3D scene interaction could be unnatural for the user compared to using hand gestures such \eg pinching to grab 3D objects. To address these issues, we propose a user study comparing the use of the classic controllers (six-degree-of-freedom (6-DOF) or trackballs) in HMD and spherical-based systems, and the hand tracking and gestures in both VR immersive modes. In particular, we focused on the possible differences between spherical-based systems and HMD in terms of the level of immersion perceived by the user, the mode of user interaction (controller and hands), on the reaction of users concerning usefulness, easiness, and behavioral intention to use.
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at //github.com/rossumai/docile.
Many tools empower analysts and data scientists to consume analysis results in a visual interface, such as a dashboard. When the underlying data changes, these results need to be updated, but this update can take a long time -- all while the user continues to explore the results. In this context, tools can either (i) hide away results that haven't been updated, hindering exploration; (ii) make the updated results immediately available to the user (on the same screen as old results), leading to confusion and incorrect insights; or (iii) present old -- and therefore stale -- results to the user during the update. To help users reason about these options and others, and make appropriate trade-offs, we introduce Transactional Panorama, a formal framework that adopts transactions to jointly model the system refreshing the analysis results and the user interacting with them. We introduce three key properties that are important for user perception in this context, visibility (allowing users to continuously explore results), consistency (ensuring that results resented are from the same version of the data), and monotonicity (making sure that results don't "go back in time"). Within transactional panorama, we characterize all of the feasible property combinations, design new mechanisms (that we call lenses) for presenting analysis results to the user while preserving a given property combination, formally prove their relative orderings for various performance criteria and discuss their use cases. We propose novel algorithms to preserve each property combination and efficiently present fresh analysis results. We implement our transactional panorama framework in a popular, open-source BI tool, illustrate the relative performance implications of different lenses, demonstrate the benefits of the novel lenses, and outline the performance improvement by our optimizations.
This article presents a novel telepresence system for advancing aerial manipulation in dynamic and unstructured environments. The proposed system not only features a haptic device, but also a virtual reality (VR) interface that provides real-time 3D displays of the robot's workspace as well as a haptic guidance to its remotely located operator. To realize this, multiple sensors namely a LiDAR, cameras and IMUs are utilized. For processing of the acquired sensory data, pose estimation pipelines are devised for industrial objects of both known and unknown geometries. We further propose an active learning pipeline in order to increase the sample efficiency of a pipeline component that relies on Deep Neural Networks (DNNs) based object detection. All these algorithms jointly address various challenges encountered during the execution of perception tasks in industrial scenarios. In the experiments, exhaustive ablation studies are provided to validate the proposed pipelines. Methodologically, these results commonly suggest how an awareness of the algorithms' own failures and uncertainty (`introspection') can be used tackle the encountered problems. Moreover, outdoor experiments are conducted to evaluate the effectiveness of the overall system in enhancing aerial manipulation capabilities. In particular, with flight campaigns over days and nights, from spring to winter, and with different users and locations, we demonstrate over 70 robust executions of pick-and-place, force application and peg-in-hole tasks with the DLR cable-Suspended Aerial Manipulator (SAM). As a result, we show the viability of the proposed system in future industrial applications.
In recent years, thanks to the rapid development of deep learning (DL), DL-based multi-task learning (MTL) has made significant progress, and it has been successfully applied to recommendation systems (RS). However, in a recommender system, the correlations among the involved tasks are complex. Therefore, the existing MTL models designed for RS suffer from negative transfer to different degrees, which will injure optimization in MTL. We find that the root cause of negative transfer is feature redundancy that features learned for different tasks interfere with each other. To alleviate the issue of negative transfer, we propose a novel multi-task learning method termed Feature Decomposition Network (FDN). The key idea of the proposed FDN is reducing the phenomenon of feature redundancy by explicitly decomposing features into task-specific features and task-shared features with carefully designed constraints. We demonstrate the effectiveness of the proposed method on two datasets, a synthetic dataset and a public datasets (i.e., Ali-CCP). Experimental results show that our proposed FDN can outperform the state-of-the-art (SOTA) methods by a noticeable margin.
3D-aware GANs offer new capabilities for creative content editing, such as view synthesis, while preserving the editing capability of their 2D counterparts. Using GAN inversion, these methods can reconstruct an image or a video by optimizing/predicting a latent code and achieve semantic editing by manipulating the latent code. However, a model pre-trained on a face dataset (e.g., FFHQ) often has difficulty handling faces with out-of-distribution (OOD) objects, (e.g., heavy make-up or occlusions). We address this issue by explicitly modeling OOD objects in face videos. Our core idea is to represent the face in a video using two neural radiance fields, one for in-distribution and the other for out-of-distribution data, and compose them together for reconstruction. Such explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate our method's reconstruction accuracy and editability on challenging real videos and showcase favorable results against other baselines.
Intention prediction has become a relevant field of research in Human-Machine and Human-Robot Interaction. Indeed, any artificial system (co)-operating with and along humans, designed to assist and coordinate its actions with a human partner, would benefit from first inferring the human's current intention. To spare the user the cognitive burden of explicitly uttering their goals, this inference relies mostly on behavioral cues deemed indicative of the current action. It has been long known that eye movements are highly anticipatory of the single steps unfolding during a task, hence they can serve as a very early and reliable behavioural cue for intention recognition. This review aims to draw a line between insights in the psychological literature on visuomotor control and relevant applications of gaze-based intention recognition in technical domains, with a focus on teleoperated and assistive robotic systems. Starting from the cognitive principles underlying the relationship between intentions, eye movements, and action, the use of eye tracking and gaze-based models for intent recognition in Human-Robot Interaction is considered, with prevalent methodologies and their diverse applications. Finally, special consideration is given to relevant human factors issues and current limitations to be factored in when designing such systems.
Gaze cueing is a fundamental part of social interactions, and broadly studied using Posner task based gaze cueing paradigms. While studies using human stimuli consistently yield a gaze cueing effect, results from studies using robotic stimuli are inconsistent. Typically, these studies use virtual agents or pictures of robots. As previous research has pointed to the significance of physical presence in human-robot interaction, it is of fundamental importance to understand its yet unexplored role in interactions with gaze cues. This paper investigates whether the physical presence of the iCub humanoid robot affects the strength of the gaze cueing effect in human-robot interaction. We exposed 42 participants to a gaze cueing task. We asked participants to react as quickly and accurately as possible to the appearance of a target stimulus that was either congruently or incongruently cued by the gaze of a copresent iCub robot or a virtual version of the same robot. Analysis of the reaction time measurements showed that participants were consistently affected by their robot interaction partner's gaze, independently on the way the robot was presented. Additional analyses of participants' ratings of the robot's anthropomorphism, animacy and likeability further add to the impression that presence does not play a significant role in simple gaze based interactions. Together our findings open up interesting discussions about the possibility to generalize results from studies using virtual agents to real life interactions with copresent robots.
Meta-learning has gained wide popularity as a training framework that is more data-efficient than traditional machine learning methods. However, its generalization ability in complex task distributions, such as multimodal tasks, has not been thoroughly studied. Recently, some studies on multimodality-based meta-learning have emerged. This survey provides a comprehensive overview of the multimodality-based meta-learning landscape in terms of the methodologies and applications. We first formalize the definition of meta-learning and multimodality, along with the research challenges in this growing field, such as how to enrich the input in few-shot or zero-shot scenarios and how to generalize the models to new tasks. We then propose a new taxonomy to systematically discuss typical meta-learning algorithms combined with multimodal tasks. We investigate the contributions of related papers and summarize them by our taxonomy. Finally, we propose potential research directions for this promising field.
Collaborative filtering often suffers from sparsity and cold start problems in real recommendation scenarios, therefore, researchers and engineers usually use side information to address the issues and improve the performance of recommender systems. In this paper, we consider knowledge graphs as the source of side information. We propose MKR, a Multi-task feature learning approach for Knowledge graph enhanced Recommendation. MKR is a deep end-to-end framework that utilizes knowledge graph embedding task to assist recommendation task. The two tasks are associated by cross&compress units, which automatically share latent features and learn high-order interactions between items in recommender systems and entities in the knowledge graph. We prove that cross&compress units have sufficient capability of polynomial approximation, and show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning. Through extensive experiments on real-world datasets, we demonstrate that MKR achieves substantial gains in movie, book, music, and news recommendation, over state-of-the-art baselines. MKR is also shown to be able to maintain a decent performance even if user-item interactions are sparse.