This paper proposes a novel application system for the generation of three-dimensional (3D) character animation driven by markerless human body motion capturing. The entire pipeline of the system consists of five stages: 1) the capturing of motion data using multiple cameras, 2) detection of the two-dimensional (2D) human body joints, 3) estimation of the 3D joints, 4) calculation of bone transformation matrices, and 5) generation of character animation. The main objective of this study is to generate a 3D skeleton and animation for 3D characters using multi-view images captured by ordinary cameras. The computational complexity of the 3D skeleton reconstruction based on 3D vision has been reduced as needed to achieve frame-by-frame motion capturing. The experimental results reveal that our system can effectively and efficiently capture human actions and use them to animate 3D cartoon characters in real-time.
Due to the lack of depth information of images and poor detection accuracy in monocular 3D object detection, we proposed the instance depth for multi-scale monocular 3D object detection method. Firstly, to enhance the model's processing ability for different scale targets, a multi-scale perception module based on dilated convolution is designed, and the depth features containing multi-scale information are re-refined from both spatial and channel directions considering the inconsistency between feature maps of different scales. Firstly, we designed a multi-scale perception module based on dilated convolution to enhance the model's processing ability for different scale targets. The depth features containing multi-scale information are re-refined from spatial and channel directions considering the inconsistency between feature maps of different scales. Secondly, so as to make the model obtain better 3D perception, this paper proposed to use the instance depth information as an auxiliary learning task to enhance the spatial depth feature of the 3D target and use the sparse instance depth to supervise the auxiliary task. Finally, by verifying the proposed algorithm on the KITTI test set and evaluation set, the experimental results show that compared with the baseline method, the proposed method improves by 5.27\% in AP40 in the car category, effectively improving the detection performance of the monocular 3D object detection algorithm.
Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a dynamic sparse graph pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning algorithm followed by a cluster sampling method for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from the homogeneous transformation, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
Event-based cameras are raising interest within the computer vision community. These sensors operate with asynchronous pixels, emitting events, or "spikes", when the luminance change at a given pixel since the last event surpasses a certain threshold. Thanks to their inherent qualities, such as their low power consumption, low latency and high dynamic range, they seem particularly tailored to applications with challenging temporal constraints and safety requirements. Event-based sensors are an excellent fit for Spiking Neural Networks (SNNs), since the coupling of an asynchronous sensor with neuromorphic hardware can yield real-time systems with minimal power requirements. In this work, we seek to develop one such system, using both event sensor data from the DSEC dataset and spiking neural networks to estimate optical flow for driving scenarios. We propose a U-Net-like SNN which, after supervised training, is able to make dense optical flow estimations. To do so, we encourage both minimal norm for the error vector and minimal angle between ground-truth and predicted flow, training our model with back-propagation using a surrogate gradient. In addition, the use of 3d convolutions allows us to capture the dynamic nature of the data by increasing the temporal receptive fields. Upsampling after each decoding stage ensures that each decoder's output contributes to the final estimation. Thanks to separable convolutions, we have been able to develop a light model (when compared to competitors) that can nonetheless yield reasonably accurate optical flow estimates.
This paper describes algorithms for the animation of men and women performing three dynamic athletic behaviors: running, bicycling, and vaulting. We animate these behaviors using control algorithms that cause a physically realistic model to perform the desired maneuver. For example, control algorithms allow the simulated humans to maintain balance while moving their arms, to run or bicycle at a variety of speeds, and to perform a handspring vault. Algorithms for group behaviors allow a number of simulated bicyclists to ride as a group while avoiding simple patterns of obstacles. We add secondary motion to the animations with spring-mass simulations of clothing driven by the rigid-body motion of the simulated human. For each simulation, we compare the computed motion to that of humans performing similar maneuvers both qualitatively through the comparison of real and simulated video images and quantitatively through the comparison of simulated and biomechanical data.
This article presents a novel telepresence system for advancing aerial manipulation in dynamic and unstructured environments. The proposed system not only features a haptic device, but also a virtual reality (VR) interface that provides real-time 3D displays of the robot's workspace as well as a haptic guidance to its remotely located operator. To realize this, multiple sensors namely a LiDAR, cameras and IMUs are utilized. For processing of the acquired sensory data, pose estimation pipelines are devised for industrial objects of both known and unknown geometries. We further propose an active learning pipeline in order to increase the sample efficiency of a pipeline component that relies on Deep Neural Networks (DNNs) based object detection. All these algorithms jointly address various challenges encountered during the execution of perception tasks in industrial scenarios. In the experiments, exhaustive ablation studies are provided to validate the proposed pipelines. Methodologically, these results commonly suggest how an awareness of the algorithms' own failures and uncertainty (`introspection') can be used tackle the encountered problems. Moreover, outdoor experiments are conducted to evaluate the effectiveness of the overall system in enhancing aerial manipulation capabilities. In particular, with flight campaigns over days and nights, from spring to winter, and with different users and locations, we demonstrate over 70 robust executions of pick-and-place, force application and peg-in-hole tasks with the DLR cable-Suspended Aerial Manipulator (SAM). As a result, we show the viability of the proposed system in future industrial applications.
We study the performance of state-of-the-art human keypoint detectors in the context of close proximity human-robot interaction. The detection in this scenario is specific in that only a subset of body parts such as hands and torso are in the field of view. In particular, (i) we survey existing datasets with human pose annotation from the perspective of close proximity images and prepare and make publicly available a new Human in Close Proximity (HiCP) dataset; (ii) we quantitatively and qualitatively compare state-of-the-art human whole-body 2D keypoint detection methods (OpenPose, MMPose, AlphaPose, Detectron2) on this dataset; (iii) since accurate detection of hands and fingers is critical in applications with handovers, we evaluate the performance of the MediaPipe hand detector; (iv) we deploy the algorithms on a humanoid robot with an RGB-D camera on its head and evaluate the performance in 3D human keypoint detection. A motion capture system is used as reference. The best performing whole-body keypoint detectors in close proximity were MMPose and AlphaPose, but both had difficulty with finger detection. Thus, we propose a combination of MMPose or AlphaPose for the body and MediaPipe for the hands in a single framework providing the most accurate and robust detection. We also analyse the failure modes of individual detectors -- for example, to what extent the absence of the head of the person in the image degrades performance. Finally, we demonstrate the framework in a scenario where a humanoid robot interacting with a person uses the detected 3D keypoints for whole-body avoidance maneuvers.
3D-aware GANs offer new capabilities for creative content editing, such as view synthesis, while preserving the editing capability of their 2D counterparts. Using GAN inversion, these methods can reconstruct an image or a video by optimizing/predicting a latent code and achieve semantic editing by manipulating the latent code. However, a model pre-trained on a face dataset (e.g., FFHQ) often has difficulty handling faces with out-of-distribution (OOD) objects, (e.g., heavy make-up or occlusions). We address this issue by explicitly modeling OOD objects in face videos. Our core idea is to represent the face in a video using two neural radiance fields, one for in-distribution and the other for out-of-distribution data, and compose them together for reconstruction. Such explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate our method's reconstruction accuracy and editability on challenging real videos and showcase favorable results against other baselines.
Diffusion Models (DMs) achieve state-of-the-art performance in generative tasks, boosting a wave in AI for Art. Despite the success of commercialization, DMs meanwhile provide tools for copyright violations, where infringers benefit from illegally using paintings created by human artists to train DMs and generate novel paintings in a similar style. In this paper, we show that it is possible to create an image $x'$ that is similar to an image $x$ for human vision but unrecognizable for DMs. We build a framework to define and evaluate this adversarial example for diffusion models. Based on the framework, we further propose AdvDM, an algorithm to generate adversarial examples for DMs. By optimizing upon different latent variables sampled from the reverse process of DMs, AdvDM conducts a Monte-Carlo estimation of adversarial examples for DMs. Extensive experiments show that the estimated adversarial examples can effectively hinder DMs from extracting their features. Our method can be a powerful tool for human artists to protect their copyright against infringers with DM-based AI-for-Art applications.
For the problems of low recognition rate and slow recognition speed of traditional detection methods in IC appearance defect detection, we propose an IC appearance defect detection algo-rithm IH-ViT. Our proposed model takes advantage of the respective strengths of CNN and ViT to acquire image features from both local and global aspects, and finally fuses the two features for decision making to determine the class of defects, thus obtaining better accuracy of IC defect recognition. To address the problem that IC appearance defects are mainly reflected in the dif-ferences in details, which are difficult to identify by traditional algorithms, we improved the tra-ditional ViT by performing an additional convolution operation inside the batch. For the problem of information imbalance of samples due to diverse sources of data sets, we adopt a dual-channel image segmentation technique to further improve the accuracy of IC appearance defects. Finally, after testing, our proposed hybrid IH-ViT model achieved 72.51% accuracy, which is 2.8% and 6.06% higher than ResNet50 and ViT models alone. The proposed algorithm can quickly and accurately detect the defect status of IC appearance and effectively improve the productivity of IC packaging and testing companies.
Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth's surface, but also due to the scarcity of well-annotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect $2806$ aerial images from different sensors and platforms. Each image is of the size about 4000-by-4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using $15$ common object categories. The fully annotated DOTA images contains $188,282$ instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging.