Triangle mesh-based maps have proven to be a powerful 3D representation of the environment, allowing robots to navigate using universal methods, indoors as well as in challenging outdoor environments with tunnels, hills and varying slopes. However, any robot that navigates autonomously necessarily requires stable, accurate, and continuous localization in such a mesh map where it plans its paths and missions. We present MICP-L, a novel and very fast \textit{Mesh ICP Localization} method that can register one or more range sensors directly on a triangle mesh map to continuously localize a robot, determining its 6D pose in the map. Correspondences between a range sensor and the mesh are found through simulations accelerated with the latest RTX hardware. With MICP-L, a correction can be performed quickly and in parallel even with combined data from different range sensor models. With this work, we aim to significantly advance the development in the field of mesh-based environment representation for autonomous robotic applications. MICP-L is open source and fully integrated with ROS and tf.
An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and Penn Action datasets, and showed results better than or similar to pre-trained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented.
Wireless Sensor Network (WSN) applications reshape the trend of warehouse monitoring systems allowing them to track and locate massive numbers of logistic entities in real-time. To support the tasks, classic Radio Frequency (RF)-based localization approaches (e.g. triangulation and trilateration) confront challenges due to multi-path fading and signal loss in noisy warehouse environment. In this paper, we investigate machine learning methods using a new grid-based WSN platform called Sensor Floor that can overcome the issues. Sensor Floor consists of 345 nodes installed across the floor of our logistic research hall with dual-band RF and Inertial Measurement Unit (IMU) sensors. Our goal is to localize all logistic entities, for this study we use a mobile robot. We record distributed sensing measurements of Received Signal Strength Indicator (RSSI) and IMU values as the dataset and position tracking from Vicon system as the ground truth. The asynchronous collected data is pre-processed and trained using Random Forest and Convolutional Neural Network (CNN). The CNN model with regularization outperforms the Random Forest in terms of localization accuracy with aproximate 15 cm. Moreover, the CNN architecture can be configured flexibly depending on the scenario in the warehouse. The hardware, software and the CNN architecture of the Sensor Floor are open-source under //github.com/FLW-TUDO/sensorfloor.
Deformable image registration is a key task in medical image analysis. The Brain Tumor Sequence Registration challenge (BraTS-Reg) aims at establishing correspondences between pre-operative and follow-up scans of the same patient diagnosed with an adult brain diffuse high-grade glioma and intends to address the challenging task of registering longitudinal data with major tissue appearance changes. In this work, we proposed a two-stage cascaded network based on the Inception and TransMorph models. The dataset for each patient was comprised of a native pre-contrast (T1), a contrast-enhanced T1-weighted (T1-CE), a T2-weighted (T2), and a Fluid Attenuated Inversion Recovery (FLAIR). The Inception model was used to fuse the 4 image modalities together and extract the most relevant information. Then, a variant of the TransMorph architecture was adapted to generate the displacement fields. The Loss function was composed of a standard image similarity measure, a diffusion regularizer, and an edge-map similarity measure added to overcome intensity dependence and reinforce correct boundary deformation. We observed that the addition of the Inception module substantially increased the performance of the network. Additionally, performing an initial affine registration before training the model showed improved accuracy in the landmark error measurements between pre and post-operative MRIs. We observed that our best model composed of the Inception and TransMorph architectures while using an initially affine registered dataset had the best performance with a median absolute error of 2.91 (initial error = 7.8). We achieved 6th place at the time of model submission in the final testing phase of the BraTS-Reg challenge.
Timely and effective feedback within surgical training plays a critical role in developing the skills required to perform safe and efficient surgery. Feedback from expert surgeons, while especially valuable in this regard, is challenging to acquire due to their typically busy schedules, and may be subject to biases. Formal assessment procedures like OSATS and GEARS attempt to provide objective measures of skill, but remain time-consuming. With advances in machine learning there is an opportunity for fast and objective automated feedback on technical skills. The SimSurgSkill 2021 challenge (hosted as a sub-challenge of EndoVis at MICCAI 2021) aimed to promote and foster work in this endeavor. Using virtual reality (VR) surgical tasks, competitors were tasked with localizing instruments and predicting surgical skill. Here we summarize the winning approaches and how they performed. Using this publicly available dataset and results as a springboard, future work may enable more efficient training of surgeons with advances in surgical data science. The dataset can be accessed from //console.cloud.google.com/storage/browser/isi-simsurgskill-2021.
The ability to learn from human demonstration endows robots with the ability to automate various tasks. However, directly learning from human demonstration is challenging since the structure of the human hand can be very different from the desired robot gripper. In this work, we show that manipulation skills can be transferred from a human to a robot through the use of micro-evolutionary reinforcement learning, where a five-finger human dexterous hand robot gradually evolves into a commercial robot, while repeated interacting in a physics simulator to continuously update the policy that is first learned from human demonstration. To deal with the high dimensions of robot parameters, we propose an algorithm for multi-dimensional evolution path searching that allows joint optimization of both the robot evolution path and the policy. Through experiments on human object manipulation datasets, we show that our framework can efficiently transfer the expert human agent policy trained from human demonstrations in diverse modalities to target commercial robots.
Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.
Autonomous Micro Aerial Vehicles are deployed for a variety tasks including surveillance and monitoring. Perching and staring allow the vehicle to monitor targets without flying, saving battery power and increasing the overall mission time without the need to frequently replace batteries. This paper addresses the Active Visual Perching (AVP) control problem to autonomously perch on inclined surfaces up to $90^\circ$. Our approach generates dynamically feasible trajectories to navigate and perch on a desired target location, while taking into account actuator and Field of View (FoV) constraints. By replanning in mid-flight, we take advantage of more accurate target localization increasing the perching maneuver's robustness to target localization or control errors. We leverage the Karush-Kuhn-Tucker (KKT) conditions to identify the compatibility between planning objectives and the visual sensing constraint during the planned maneuver. Furthermore, we experimentally identify the corresponding boundary conditions that maximizes the spatio-temporal target visibility during the perching maneuver. The proposed approach works on-board in real-time with significant computational constraints relying exclusively on cameras and an Inertial Measurement Unit (IMU). Experimental results validate the proposed approach and shows the higher success rate as well as increased target interception precision and accuracy with respect to a one-shot planning approach, while still retaining aggressive capabilities with flight envelopes that include large excursions from the hover position on inclined surfaces up to 90$^\circ$, angular speeds up to 750~deg/s, and accelerations up to 10~m/s$^2$.
Interpretability methods are developed to understand the working mechanisms of black-box models, which is crucial to their responsible deployment. Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them. While the former has been addressed in prior work, the latter is often overlooked, resulting in informal model understanding derived from a handful of local explanations. In this paper, we introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding, and propose metrics for its quality assessment. On two domains, ExSum highlights various limitations in the current practice, helps develop accurate model understanding, and reveals easily overlooked properties of the model. We also connect understandability to other properties of explanations such as human alignment, robustness, and counterfactual minimality and plausibility.
Estimating human pose and shape from monocular images is a long-standing problem in computer vision. Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention. With the same goal of obtaining well-aligned and physically plausible mesh results, two paradigms have been developed to overcome challenges in the 2D-to-3D lifting process: i) an optimization-based paradigm, where different data terms and regularization terms are exploited as optimization objectives; and ii) a regression-based paradigm, where deep learning techniques are embraced to solve the problem in an end-to-end fashion. Meanwhile, continuous efforts are devoted to improving the quality of 3D mesh labels for a wide range of datasets. Though remarkable progress has been achieved in the past decade, the task is still challenging due to flexible body motions, diverse appearances, complex environments, and insufficient in-the-wild annotations. To the best of our knowledge, this is the first survey to focus on the task of monocular 3D human mesh recovery. We start with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses. We also summarize datasets, evaluation metrics, and benchmark results. Open issues and future directions are discussed in the end, hoping to motivate researchers and facilitate their research in this area. A regularly updated project page can be found at //github.com/tinatiansjz/hmr-survey.
Deployment of Internet of Things (IoT) devices and Data Fusion techniques have gained popularity in public and government domains. This usually requires capturing and consolidating data from multiple sources. As datasets do not necessarily originate from identical sensors, fused data typically results in a complex data problem. Because military is investigating how heterogeneous IoT devices can aid processes and tasks, we investigate a multi-sensor approach. Moreover, we propose a signal to image encoding approach to transform information (signal) to integrate (fuse) data from IoT wearable devices to an image which is invertible and easier to visualize supporting decision making. Furthermore, we investigate the challenge of enabling an intelligent identification and detection operation and demonstrate the feasibility of the proposed Deep Learning and Anomaly Detection models that can support future application that utilizes hand gesture data from wearable devices.