In this paper, we propose mushrooms detection, localization and 3D pose estimation algorithm using RGB-D data acquired from a low-cost consumer RGB-D sensor. We use the RGB and depth information for different purposes. From RGB color, we first extract initial contour locations of the mushrooms and then provide both the initial contour locations and the original image to active contour for mushrooms segmentation. These segmented mushrooms are then used as input to a circular Hough transform for each mushroom detection including its center and radius. Once each mushroom's center position in the RGB image is known, we then use the depth information to locate it in 3D space i.e. in world coordinate system. In case of missing depth information at the detected center of each mushroom, we estimate from the nearest available depth information within the radius of each mushroom. We also estimate the 3D pose of each mushroom using a pre-prepared upright mushroom model. We use a global registration followed by local refine registration approach for this 3D pose estimation. From the estimated 3D pose, we use only the rotation part expressed in quaternion as an orientation of each mushroom. These estimated (X,Y,Z) positions, diameters and orientations of the mushrooms are used for robotic-picking applications. We carry out extensive experiments on both 3D printed and real mushrooms which show that our method has an interesting performance.
Anomaly detection among a large number of processes arises in many applications ranging from dynamic spectrum access to cybersecurity. In such problems one can often obtain noisy observations aggregated from a chosen subset of processes that conforms to a tree structure. The distribution of these observations, based on which the presence of anomalies is detected, may be only partially known. This gives rise to the need for a search strategy designed to account for both the sample complexity and the detection accuracy, as well as cope with statistical models that are known only up to some missing parameters. In this work we propose a sequential search strategy using two variations of the Generalized Local Likelihood Ratio statistic. Our proposed Hierarchical Dynamic Search (HDS) strategy is shown to be order-optimal with respect to the size of the search space and asymptotically optimal with respect to the detection accuracy. An explicit upper bound on the error probability of HDS is established for the finite sample regime. Extensive experiments are conducted, demonstrating the performance gains of HDS over existing methods.
Recent advances in deep learning and computer vision offer an excellent opportunity to investigate high-level visual analysis tasks such as human localization and human pose estimation. Although the performance of human localization and human pose estimation has significantly improved in recent reports, they are not perfect and erroneous localization and pose estimation can be expected among video frames. Studies on the integration of these techniques into a generic pipeline that is robust to noise introduced from those errors are still lacking. This paper fills the missing study. We explored and developed two working pipelines that suited the visual-based positioning and pose estimation tasks. Analyses of the proposed pipelines were conducted on a badminton game. We showed that the concept of tracking by detection could work well, and errors in position and pose could be effectively handled by a linear interpolation technique using information from nearby frames. The results showed that the Visual-based Positioning and Pose Estimation could deliver position and pose estimations with good spatial and temporal resolutions.
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images. Therefore, how to effectively extract inter-image correspondence is crucial for the CoSOD task. In this paper, we propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM) to capture comprehensive inter-image corresponding relationship among different images from the global and local perspectives. Firstly, we treat different images as different time slices and use 3D convolution to integrate all intra features intuitively, which can more fully extract the global group semantics. Secondly, we design a pairwise correlation transformation (PCT) to explore similarity correspondence between pairwise images and combine the multiple local pairwise correspondences to generate the local inter-image relationship. Thirdly, the inter-image relationships of the GCM and LCM are integrated through a global-and-local correspondence aggregation (GLA) module to explore more comprehensive inter-image collaboration cues. Finally, the intra- and inter-features are adaptively integrated by an intra-and-inter weighting fusion (AEWF) module to learn co-saliency features and predict the co-saliency map. The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images).
Ball 3D localization in team sports has various applications including automatic offside detection in soccer, or shot release localization in basketball. Today, this task is either resolved by using expensive multi-views setups, or by restricting the analysis to ballistic trajectories. In this work, we propose to address the task on a single image from a calibrated monocular camera by estimating ball diameter in pixels and use the knowledge of real ball diameter in meters. This approach is suitable for any game situation where the ball is (even partly) visible. To achieve this, we use a small neural network trained on image patches around candidates generated by a conventional ball detector. Besides predicting ball diameter, our network outputs the confidence of having a ball in the image patch. Validations on 3 basketball datasets reveals that our model gives remarkable predictions on ball 3D localization. In addition, through its confidence output, our model improves the detection rate by filtering the candidates produced by the detector. The contributions of this work are (i) the first model to address 3D ball localization on a single image, (ii) an effective method for ball 3D annotation from single calibrated images, (iii) a high quality 3D ball evaluation dataset annotated from a single viewpoint. In addition, the code to reproduce this research is be made freely available at //github.com/gabriel-vanzandycke/deepsport.
Whole-body 3D human mesh estimation aims to reconstruct the 3D human body, hands, and face simultaneously. Although several methods have been proposed, accurate prediction of 3D hands, which consist of 3D wrist and fingers, still remains challenging due to two reasons. First, the human kinematic chain has not been carefully considered when predicting the 3D wrists. Second, previous works utilize body features for the 3D fingers, where the body feature barely contains finger information. To resolve the limitations, we present Hand4Whole, which has two strong points over previous works. First, we design Pose2Pose, a module that utilizes joint features for 3D joint rotations. Using Pose2Pose, Hand4Whole utilizes hand MCP joint features to predict 3D wrists as MCP joints largely contribute to 3D wrist rotations in the human kinematic chain. Second, Hand4Whole discards the body feature when predicting 3D finger rotations. Our Hand4Whole is trained in an end-to-end manner and produces much better 3D hand results than previous whole-body 3D human mesh estimation methods. The codes are available here at //github.com/mks0601/Hand4Whole_RELEASE.
This paper presents GoPose, a 3D skeleton-based human pose estimation system that uses WiFi devices at home. Our system leverages the WiFi signals reflected off the human body for 3D pose estimation. In contrast to prior systems that need specialized hardware or dedicated sensors, our system does not require a user to wear or carry any sensors and can reuse the WiFi devices that already exist in a home environment for mass adoption. To realize such a system, we leverage the 2D AoA spectrum of the signals reflected from the human body and the deep learning techniques. In particular, the 2D AoA spectrum is proposed to locate different parts of the human body as well as to enable environment-independent pose estimation. Deep learning is incorporated to model the complex relationship between the 2D AoA spectrums and the 3D skeletons of the human body for pose tracking. Our evaluation results show GoPose achieves around 4.7cm of accuracy under various scenarios including tracking unseen activities and under NLoS scenarios.
Multi-camera vehicle tracking is one of the most complicated tasks in Computer Vision as it involves distinct tasks including Vehicle Detection, Tracking, and Re-identification. Despite the challenges, multi-camera vehicle tracking has immense potential in transportation applications including speed, volume, origin-destination (O-D), and routing data generation. Several recent works have addressed the multi-camera tracking problem. However, most of the effort has gone towards improving accuracy on high-quality benchmark datasets while disregarding lower camera resolutions, compression artifacts and the overwhelming amount of computational power and time needed to carry out this task on its edge and thus making it prohibitive for large-scale and real-time deployment. Therefore, in this work we shed light on practical issues that should be addressed for the design of a multi-camera tracking system to provide actionable and timely insights. Moreover, we propose a real-time city-scale multi-camera vehicle tracking system that compares favorably to computationally intensive alternatives and handles real-world, low-resolution CCTV instead of idealized and curated video streams. To show its effectiveness, in addition to integration into the Regional Integrated Transportation Information System (RITIS), we participated in the 2021 NVIDIA AI City multi-camera tracking challenge and our method is ranked among the top five performers on the public leaderboard.
Leveraging line features to improve localization accuracy of point-based visual-inertial SLAM (VINS) is gaining interest as they provide additional constraints on scene structure. However, real-time performance when incorporating line features in VINS has not been addressed. This paper presents PL-VINS, a real-time optimization-based monocular VINS method with point and line features, developed based on the state-of-the-art point-based VINS-Mono \cite{vins}. We observe that current works use the LSD \cite{lsd} algorithm to extract line features; however, LSD is designed for scene shape representation instead of the pose estimation problem, which becomes the bottleneck for the real-time performance due to its high computational cost. In this paper, a modified LSD algorithm is presented by studying a hidden parameter tuning and length rejection strategy. The modified LSD can run at least three times as fast as LSD. Further, by representing space lines with the Pl\"{u}cker coordinates, the residual error in line estimation is modeled in terms of the point-to-line distance, which is then minimized by iteratively updating the minimum four-parameter orthonormal representation of the Pl\"{u}cker coordinates. Experiments in a public benchmark dataset show that the localization error of our method is 12-16\% less than that of VINS-Mono at the same pose update frequency. %For the benefit of the community, The source code of our method is available at: //github.com/cnqiangfu/PL-VINS.
Substantial efforts have been devoted more recently to presenting various methods for object detection in optical remote sensing images. However, the current survey of datasets and deep learning based methods for object detection in optical remote sensing images is not adequate. Moreover, most of the existing datasets have some shortcomings, for example, the numbers of images and object categories are small scale, and the image diversity and variations are insufficient. These limitations greatly affect the development of deep learning based object detection methods. In the paper, we provide a comprehensive review of the recent deep learning based object detection progress in both the computer vision and earth observation communities. Then, we propose a large-scale, publicly available benchmark for object DetectIon in Optical Remote sensing images, which we name as DIOR. The dataset contains 23463 images and 192472 instances, covering 20 object classes. The proposed DIOR dataset 1) is large-scale on the object categories, on the object instance number, and on the total image number; 2) has a large range of object size variations, not only in terms of spatial resolutions, but also in the aspect of inter- and intra-class size variability across objects; 3) holds big variations as the images are obtained with different imaging conditions, weathers, seasons, and image quality; and 4) has high inter-class similarity and intra-class diversity. The proposed benchmark can help the researchers to develop and validate their data-driven methods. Finally, we evaluate several state-of-the-art approaches on our DIOR dataset to establish a baseline for future research.
This work addresses a novel and challenging problem of estimating the full 3D hand shape and pose from a single RGB image. Most current methods in 3D hand analysis from monocular RGB images only focus on estimating the 3D locations of hand keypoints, which cannot fully express the 3D shape of hand. In contrast, we propose a Graph Convolutional Neural Network (Graph CNN) based method to reconstruct a full 3D mesh of hand surface that contains richer information of both 3D hand shape and pose. To train networks with full supervision, we create a large-scale synthetic dataset containing both ground truth 3D meshes and 3D poses. When fine-tuning the networks on real-world datasets without 3D ground truth, we propose a weakly-supervised approach by leveraging the depth map as a weak supervision in training. Through extensive evaluations on our proposed new datasets and two public datasets, we show that our proposed method can produce accurate and reasonable 3D hand mesh, and can achieve superior 3D hand pose estimation accuracy when compared with state-of-the-art methods.