Understanding social interactions from egocentric views is crucial for many applications, ranging from assistive robotics to AR/VR. Key to reasoning about interactions is to understand the body pose and motion of the interaction partner from the egocentric view. However, research in this area is severely hindered by the lack of datasets. Existing datasets are limited in terms of either size, capture/annotation modalities, ground-truth quality, or interaction diversity. We fill this gap by proposing EgoBody, a novel large-scale dataset for human pose, shape and motion estimation from egocentric views, during interactions in complex 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human shapes and poses relative to the scene, over time. We collect 125 sequences, spanning diverse interaction scenarios, and propose the first benchmark for 3D full-body pose and shape estimation of the social partner from egocentric views. We extensively evaluate state-of-the-art methods, highlight their limitations in the egocentric scenario, and address such limitations leveraging our high-quality annotations. Data and code are available at //sanweiliti.github.io/egobody/egobody.html.
Semantic segmentation is a challenging computer vision task demanding a significant amount of pixel-level annotated data. Producing such data is a time-consuming and costly process, especially for domains with a scarcity of experts, such as medicine or forensic anthropology. While numerous semi-supervised approaches have been developed to make the most from the limited labeled data and ample amount of unlabeled data, domain-specific real-world datasets often have characteristics that both reduce the effectiveness of off-the-shelf state-of-the-art methods and also provide opportunities to create new methods that exploit these characteristics. We propose and evaluate a semi-supervised method that reuses available labels for unlabeled images of a dataset by exploiting existing similarities, while dynamically weighting the impact of these reused labels in the training process. We evaluate our method on a large dataset of human decomposition images and find that our method, while conceptually simple, outperforms state-of-the-art consistency and pseudo-labeling-based methods for the segmentation of this dataset. This paper includes graphic content of human decomposition.
Wearable cameras allow to acquire images and videos from the user's perspective. These data can be processed to understand humans behavior. Despite human behavior analysis has been thoroughly investigated in third person vision, it is still understudied in egocentric settings and in particular in industrial scenarios. To encourage research in this field, we present MECCANO, a multimodal dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view, such as recognizing and anticipating human-object interactions. With the MECCANO dataset, we explored five different tasks including 1) Action Recognition, 2) Active Objects Detection and Recognition, 3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and 5) Next-Active Objects Detection. We propose a benchmark aimed to study human behavior in the considered industrial-like scenario which demonstrates that the investigated tasks and the considered scenario are challenging for state-of-the-art algorithms. To support research in this field, we publicy release the dataset at //iplab.dmi.unict.it/MECCANO/.
Obstacle detection is a safety-critical problem in robot navigation, where stereo matching is a popular vision-based approach. While deep neural networks have shown impressive results in computer vision, most of the previous obstacle detection works only leverage traditional stereo matching techniques to meet the computational constraints for real-time feedback. This paper proposes a computationally efficient method that leverages a deep neural network to detect occupancy from stereo images directly. Instead of learning the point cloud correspondence from the stereo data, our approach extracts the compact obstacle distribution based on volumetric representations. In addition, we prune the computation of safety irrelevant spaces in a coarse-to-fine manner based on octrees generated by the decoder. As a result, we achieve real-time performance on the onboard computer (NVIDIA Jetson TX2). Our approach detects obstacles accurately in the range of 32 meters and achieves better IoU (Intersection over Union) and CD (Chamfer Distance) scores with only 2% of the computation cost of the state-of-the-art stereo model. Furthermore, we validate our method's robustness and real-world feasibility through autonomous navigation experiments with a real robot. Hence, our work contributes toward closing the gap between the stereo-based system in robot perception and state-of-the-art stereo models in computer vision. To counter the scarcity of high-quality real-world indoor stereo datasets, we collect a 1.36 hours stereo dataset with a Jackal robot which is used to fine-tune our model. The dataset, the code, and more visualizations are available at //lhy.xyz/stereovoxelnet/
Learning-based visual odometry (VO) algorithms achieve remarkable performance on common static scenes, benefiting from high-capacity models and massive annotated data, but tend to fail in dynamic, populated environments. Semantic segmentation is largely used to discard dynamic associations before estimating camera motions but at the cost of discarding static features and is hard to scale up to unseen categories. In this paper, we leverage the mutual dependence between camera ego-motion and motion segmentation and show that both can be jointly refined in a single learning-based framework. In particular, we present DytanVO, the first supervised learning-based VO method that deals with dynamic environments. It takes two consecutive monocular frames in real-time and predicts camera ego-motion in an iterative fashion. Our method achieves an average improvement of 27.7% in ATE over state-of-the-art VO solutions in real-world dynamic environments, and even performs competitively among dynamic visual SLAM systems which optimize the trajectory on the backend. Experiments on plentiful unseen environments also demonstrate our method's generalizability.
Kitting refers to the task of preparing and grouping necessary parts and tools (or "kits") for assembly in a manufacturing environment. Automating this process simplifies the assembly task for human workers and improves efficiency. Existing automated kitting systems adhere to scripted instructions and predefined heuristics. However, given variability in the availability of parts and logistic delays, the inflexibility of existing systems can limit the overall efficiency of an assembly line. In this paper, we propose a bilevel optimization framework to enable a robot to perform task segmentation-based part selection, kit arrangement, and delivery scheduling to provide custom-tailored kits just in time - i.e., right when they are needed. We evaluate the proposed approach both through a human subjects study (n=18) involving the construction of a flat-pack furniture table and shop-flow simulation based on the data from the study. Our results show that the just-in-time kitting system is objectively more efficient, resilient to upstream shop flow delays, and subjectively more preferable as compared to baseline approaches of using kits defined by rigid task segmentation boundaries defined by the task graph itself or a single kit that includes all parts necessary to assemble a single unit.
The human prioritization of image regions can be modeled in a time invariant fashion with saliency maps or sequentially with scanpath models. However, while both types of models have steadily improved on several benchmarks and datasets, there is still a considerable gap in predicting human gaze. Here, we leverage two recent developments to reduce this gap: theoretical analyses establishing a principled framework for predicting the next gaze target and the empirical measurement of the human cost for gaze switches independently of image content. We introduce an algorithm in the framework of sequential decision making, which converts any static saliency map into a sequence of dynamic history-dependent value maps, which are recomputed after each gaze shift. These maps are based on 1) a saliency map provided by an arbitrary saliency model, 2) the recently measured human cost function quantifying preferences in magnitude and direction of eye movements, and 3) a sequential exploration bonus, which changes with each subsequent gaze shift. The parameters of the spatial extent and temporal decay of this exploration bonus are estimated from human gaze data. The relative contributions of these three components were optimized on the MIT1003 dataset for the NSS score and are sufficient to significantly outperform predictions of the next gaze target on NSS and AUC scores for five state of the art saliency models on three image data sets. Thus, we provide an implementation of human gaze preferences, which can be used to improve arbitrary saliency models' predictions of humans' next gaze targets.
Robustly classifying ground infrastructure such as roads and street crossings is an essential task for mobile robots operating alongside pedestrians. While many semantic segmentation datasets are available for autonomous vehicles, models trained on such datasets exhibit a large domain gap when deployed on robots operating in pedestrian spaces. Manually annotating images recorded from pedestrian viewpoints is both expensive and time-consuming. To overcome this challenge, we propose TrackletMapper, a framework for annotating ground surface types such as sidewalks, roads, and street crossings from object tracklets without requiring human-annotated data. To this end, we project the robot ego-trajectory and the paths of other traffic participants into the ego-view camera images, creating sparse semantic annotations for multiple types of ground surfaces from which a ground segmentation model can be trained. We further show that the model can be self-distilled for additional performance benefits by aggregating a ground surface map and projecting it into the camera images, creating a denser set of training annotations compared to the sparse tracklet annotations. We qualitatively and quantitatively attest our findings on a novel large-scale dataset for mobile robots operating in pedestrian areas. Code and dataset will be made available at //trackletmapper.cs.uni-freiburg.de.
Deep learning algorithms for predicting neuroimaging data have shown considerable promise in various applications. Prior work has demonstrated that deep learning models that take advantage of the data's 3D structure can outperform standard machine learning on several learning tasks. However, most prior research in this area has focused on neuroimaging data from adults. Within the Adolescent Brain and Cognitive Development (ABCD) dataset, a large longitudinal development study, we examine structural MRI data to predict gender and identify gender-related changes in brain structure. Results demonstrate that gender prediction accuracy is exceptionally high (>97%) with training epochs >200 and that this accuracy increases with age. Brain regions identified as the most discriminative in the task under study include predominantly frontal areas and the temporal lobe. When evaluating gender predictive changes specific to a two-year increase in age, a broader set of visual, cingulate, and insular regions are revealed. Our findings show a robust gender-related structural brain change pattern, even over a small age range. This suggests that it might be possible to study how the brain changes during adolescence by looking at how these changes are related to different behavioral and environmental factors.
Optical coherence tomography (OCT) is a micrometer-scale, volumetric imaging modality that has become a clinical standard in ophthalmology. OCT instruments image by raster-scanning a focused light spot across the retina, acquiring sequential cross-sectional images to generate volumetric data. Patient eye motion during the acquisition poses unique challenges: Non-rigid, discontinuous distortions can occur, leading to gaps in data and distorted topographic measurements. We present a new distortion model and a corresponding fully-automatic, reference-free optimization strategy for computational motion correction in orthogonally raster-scanned, retinal OCT volumes. Using a novel, domain-specific spatiotemporal parametrization of forward-warping displacements, eye motion can be corrected continuously for the first time. Parameter estimation with temporal regularization improves robustness and accuracy over previous spatial approaches. We correct each A-scan individually in 3D in a single mapping, including repeated acquisitions used in OCT angiography protocols. Specialized 3D forward image warping reduces median runtime to < 9 s, fast enough for clinical use. We present a quantitative evaluation on 18 subjects with ocular pathology and demonstrate accurate correction during microsaccades. Transverse correction is limited only by ocular tremor, whereas submicron repeatability is achieved axially (0.51 um median of medians), representing a dramatic improvement over previous work. This allows assessing longitudinal changes in focal retinal pathologies as a marker of disease progression or treatment response, and promises to enable multiple new capabilities such as supersampled/super-resolution volume reconstruction and analysis of pathological eye motion occuring in neurological diseases.
This paper presents ADAMANT, a set of software modules that provides grasp planning capabilities to an existing robot planning and control software framework. Our presented work allows a user to adapt a manipulation task to be used under widely different scenarios with minimal user input, thus reducing the operator's cognitive load. The developed tools include (1) plugin-based components that make it easy to extend default capabilities and to use third-party grasp libraries, (2) An object-centric way to define task constraints, (3) A user-friendly Rviz interface to use the grasp planner utilities, and (4) Interactive tools to use perception data to program a task. We tested our framework on a wide variety of robot simulations.