In this paper, we present a framework that unites obstacle avoidance and deliberate physical interaction for robotic manipulators. As humans and robots begin to coexist in work and household environments, pure collision avoidance is insufficient, as human-robot contact is inevitable and, in some situations, desired. Our work enables manipulators to anticipate, detect, and act on contact. To achieve this, we allow limited deviation from the robot's original trajectory through velocity reduction and motion restrictions. Then, if contact occurs, a robot can detect it and maneuver based on a novel dynamic contact thresholding algorithm. The core contribution of this work is dynamic contact thresholding, which allows a manipulator with onboard proximity sensors to track nearby objects and reduce contact forces in anticipation of a collision. Our framework elicits natural behavior during physical human-robot interaction. We evaluate our system on a variety of scenarios using the Franka Emika Panda robot arm; collectively, our results demonstrate that our contribution is not only able to avoid and react on contact, but also anticipate it.
Cross-Blockchain communication has gained traction due to the increasing fragmentation of blockchain networks and scalability solutions such as side-chaining and sharding. With SmartSync, we propose a novel concept for cross-blockchain smart contract interactions that creates client contracts on arbitrary blockchain networks supporting the same execution environment. Client contracts mirror the logic and state of the original instance and enable seamless on-chain function executions providing recent states. Synchronized contracts supply instant read-only function calls to other applications hosted on the target blockchain. Hereby, current limitations in cross-chain communication are alleviated and new forms of contract interactions are enabled. State updates are transmitted in a verifiable manner using Merkle proofs and do not require trusted intermediaries. To permit lightweight synchronizations, we introduce transition confirmations that facilitate the application of verifiable state transitions without re-executing transactions of the source blockchain. We prove the concept's soundness by providing a prototypical implementation that enables smart contract forks, state synchronizations, and on-chain validation on EVM-compatible blockchains. Our evaluation demonstrates SmartSync's applicability for presented use cases providing access to recent states to third-party contracts on the target blockchain. Execution costs scale sub-linearly with the number of value updates and depend on the depth and index of corresponding Merkle proofs.
Robot systems are increasingly integrating into numerous avenues of modern life. From cleaning houses to providing guidance and emotional support, robots now work directly with humans. Due to their far-reaching applications and progressively complex architecture, they are being targeted by adversarial attacks such as sensor-actuator attacks, data spoofing, malware, and network intrusion. Therefore, security for robotic systems has become crucial. In this paper, we address the underserved area of malware detection in robotic software. Since robots work in close proximity to humans, often with direct interactions, malware could have life-threatening impacts. Hence, we propose the RoboMal framework of static malware detection on binary executables to detect malware before it gets a chance to execute. Additionally, we address the great paucity of data in this space by providing the RoboMal dataset comprising controller executables of a small-scale autonomous car. The performance of the framework is compared against widely used supervised learning models: GRU, CNN, and ANN. Notably, the LSTM-based RoboMal model outperforms the other models with an accuracy of 85% and precision of 87% in 10-fold cross-validation, hence proving the effectiveness of the proposed framework.
A real-time motion training system for skydiving is proposed. Aerial maneuvers are performed by changing the body posture and thus deflecting the surrounding airflow. The natural learning process is extremely slow due to unfamiliar free-fall dynamics, stress induced blocking of kinesthetic feedback, and complexity of the required movements. The key idea is to augment the learner with an automatic control system that would be able to perform the trained activity if it had direct access to the learner's body as an actuator. The aiding system will supply the following visual cues to the learner: 1. Feedback of the current body posture; 2. The body posture that would bring the body to perform the desired maneuver; 3. Prediction of the future inertial position and orientation if the body retains its present posture. The system will enable novices to maintain stability in free-fall and perceive the unfamiliar environmental dynamics, thus accelerating the initial stages of skill acquisition. This paper presents results of a Proof-of-Concept experiment, whereby humans controlled a virtual skydiver free-falling in a computer simulation, by the means of their bodies. This task was impossible without the aiding system, enabling all participants to complete the task at the first attempt.
Learning to solve complex manipulation tasks from visual observations is a dominant challenge for real-world robot learning. Although deep reinforcement learning algorithms have recently demonstrated impressive results in this context, they still require an impractical amount of time-consuming trial-and-error iterations. In this work, we consider the promising alternative paradigm of interactive learning in which a human teacher provides feedback to the policy during execution, as opposed to imitation learning where a pre-collected dataset of perfect demonstrations is used. Our proposed CEILing (Corrective and Evaluative Interactive Learning) framework combines both corrective and evaluative feedback from the teacher to train a stochastic policy in an asynchronous manner, and employs a dedicated mechanism to trade off human corrections with the robot's own experience. We present results obtained with our framework in extensive simulation and real-world experiments to demonstrate that CEILing can effectively solve complex robot manipulation tasks directly from raw images in less than one hour of real-world training.
Group testing can help maintain a widespread testing program using fewer resources amid a pandemic. In group testing, we are given $n$ samples, one per individual. These samples are arranged into $m < n$ pooled samples, where each pool is obtained by mixing a subset of the $n$ individual samples. Infected individuals are then identified using a group testing algorithm. In this paper, we use side information (SI) collected from contact tracing (CT) within nonadaptive/single-stage group testing algorithms. We generate CT SI data by incorporating characteristics of disease spread between individuals. These data are fed into two signal and measurement models for group testing, and numerical results show that our algorithms provide improved sensitivity and specificity. We also show how to incorporate CT SI into the design of the pooling matrix. That said, our numerical results suggest that the utilization of SI in the pooling matrix design based on the minimization of a weighted coherence measure does not yield significant performance gains beyond the incorporation of SI in the group testing algorithm.
Active inference is a unifying theory for perception and action resting upon the idea that the brain maintains an internal model of the world by minimizing free energy. From a behavioral perspective, active inference agents can be seen as self-evidencing beings that act to fulfill their optimistic predictions, namely preferred outcomes or goals. In contrast, reinforcement learning requires human-designed rewards to accomplish any desired outcome. Although active inference could provide a more natural self-supervised objective for control, its applicability has been limited because of the shortcomings in scaling the approach to complex environments. In this work, we propose a contrastive objective for active inference that strongly reduces the computational burden in learning the agent's generative model and planning future actions. Our method performs notably better than likelihood-based active inference in image-based tasks, while also being computationally cheaper and easier to train. We compare to reinforcement learning agents that have access to human-designed reward functions, showing that our approach closely matches their performance. Finally, we also show that contrastive methods perform significantly better in the case of distractors in the environment and that our method is able to generalize goals to variations in the background.
An agent that can understand natural-language instruction and carry out corresponding actions in the visual world is one of the long-term challenges of Artificial Intelligent (AI). Due to multifarious instructions from humans, it requires the agent can link natural language to vision and action in unstructured, previously unseen environments. If the instruction given by human is a navigation task, this challenge is called Visual-and-Language Navigation (VLN). It is a booming multi-disciplinary field of increasing importance and with extraordinary practicality. Instead of focusing on the details of specific methods, this paper provides a comprehensive survey on VLN tasks and makes a classification carefully according the different characteristics of language instructions in these tasks. According to when the instructions are given, the tasks can be divided into single-turn and multi-turn. For single-turn tasks, we further divided them into goal-orientation and route-orientation based on whether the instructions contain a route. For multi-turn tasks, we divided them into imperative task and interactive task based on whether the agent responses to the instructions. This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.
When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards a more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.
Latest deep learning methods for object detection provide remarkable performance, but have limits when used in robotic applications. One of the most relevant issues is the long training time, which is due to the large size and imbalance of the associated training sets, characterized by few positive and a large number of negative examples (i.e. background). Proposed approaches are based on end-to-end learning by back-propagation [22] or kernel methods trained with Hard Negatives Mining on top of deep features [8]. These solutions are effective, but prohibitively slow for on-line applications. In this paper we propose a novel pipeline for object detection that overcomes this problem and provides comparable performance, with a 60x training speedup. Our pipeline combines (i) the Region Proposal Network and the deep feature extractor from [22] to efficiently select candidate RoIs and encode them into powerful representations, with (ii) the FALKON [23] algorithm, a novel kernel-based method that allows fast training on large scale problems (millions of points). We address the size and imbalance of training data by exploiting the stochastic subsampling intrinsic into the method and a novel, fast, bootstrapping approach. We assess the effectiveness of the approach on a standard Computer Vision dataset (PASCAL VOC 2007 [5]) and demonstrate its applicability to a real robotic scenario with the iCubWorld Transformations [18] dataset.
This paper introduces a novel neural network-based reinforcement learning approach for robot gaze control. Our approach enables a robot to learn and to adapt its gaze control strategy for human-robot interaction neither with the use of external sensors nor with human supervision. The robot learns to focus its attention onto groups of people from its own audio-visual experiences, independently of the number of people, of their positions and of their physical appearances. In particular, we use a recurrent neural network architecture in combination with Q-learning to find an optimal action-selection policy; we pre-train the network using a simulated environment that mimics realistic scenarios that involve speaking/silent participants, thus avoiding the need of tedious sessions of a robot interacting with people. Our experimental evaluation suggests that the proposed method is robust against parameter estimation, i.e. the parameter values yielded by the method do not have a decisive impact on the performance. The best results are obtained when both audio and visual information is jointly used. Experiments with the Nao robot indicate that our framework is a step forward towards the autonomous learning of socially acceptable gaze behavior.