Generative models can be used to synthesize 3D objects of high quality and diversity. However, there is typically no control over the properties of the generated object.This paper proposes a novel generative adversarial network (GAN) setup that generates 3D point cloud shapes conditioned on a continuous parameter. In an exemplary application, we use this to guide the generative process to create a 3D object with a custom-fit shape. We formulate this generation process in a multi-task setting by using the concept of auxiliary classifier GANs. Further, we propose to sample the generator label input for training from a kernel density estimation (KDE) of the dataset. Our ablations show that this leads to significant performance increase in regions with few samples. Extensive quantitative and qualitative experiments show that we gain explicit control over the object dimensions while maintaining good generation quality and diversity.
Recently, subsampling or refining images generated from unconditional GANs has been actively studied to improve the overall image quality. Unfortunately, these methods are often observed less effective or inefficient in handling conditional GANs (cGANs) -- conditioning on a class (aka class-conditional GANs) or a continuous variable (aka continuous cGANs or CcGANs). In this work, we introduce an effective and efficient subsampling scheme, named conditional density ratio-guided rejection sampling (cDR-RS), to sample high-quality images from cGANs. Specifically, we first develop a novel conditional density ratio estimation method, termed cDRE-F-cSP, by proposing the conditional Softplus (cSP) loss and an improved feature extraction mechanism. We then derive the error bound of a density ratio model trained with the cSP loss. Finally, we accept or reject a fake image in terms of its estimated conditional density ratio. A filtering scheme is also developed to increase fake images' label consistency without losing diversity when sampling from CcGANs. We extensively test the effectiveness and efficiency of cDR-RS in sampling from both class-conditional GANs and CcGANs on five benchmark datasets. When sampling from class-conditional GANs, cDR-RS outperforms modern state-of-the-art methods by a large margin (except DRE-F-SP+RS) in terms of effectiveness. Although the effectiveness of cDR-RS is often comparable to that of DRE-F-SP+RS, cDR-RS is substantially more efficient. When sampling from CcGANs, the superiority of cDR-RS is even more noticeable in terms of both effectiveness and efficiency. Notably, with the consumption of reasonable computational resources, cDR-RS can substantially reduce Label Score without decreasing the diversity of CcGAN-generated images, while other methods often need to trade much diversity for slightly improved Label Score.
Recently, subsampling or refining images generated from unconditional GANs has been actively studied to improve the overall image quality. Unfortunately, these methods are often observed less effective or inefficient in handling conditional GANs (cGANs) -- conditioning on a class (aka class-conditional GANs) or a continuous variable (aka continuous cGANs or CcGANs). In this work, we introduce an effective and efficient subsampling scheme, named conditional density ratio-guided rejection sampling (cDR-RS), to sample high-quality images from cGANs. Specifically, we first develop a novel conditional density ratio estimation method, termed cDRE-F-cSP, by proposing the conditional Softplus (cSP) loss and an improved feature extraction mechanism. We then derive the error bound of a density ratio model trained with the cSP loss. Finally, we accept or reject a fake image in terms of its estimated conditional density ratio. A filtering scheme is also developed to increase fake images' label consistency without losing diversity when sampling from CcGANs. We extensively test the effectiveness and efficiency of cDR-RS in sampling from both class-conditional GANs and CcGANs on five benchmark datasets. When sampling from class-conditional GANs, cDR-RS outperforms modern state-of-the-art methods by a large margin (except DRE-F-SP+RS) in terms of effectiveness. Although the effectiveness of cDR-RS is often comparable to that of DRE-F-SP+RS, cDR-RS is substantially more efficient. When sampling from CcGANs, the superiority of cDR-RS is even more noticeable in terms of both effectiveness and efficiency. Notably, with the consumption of reasonable computational resources, cDR-RS can substantially reduce Label Score without decreasing the diversity of CcGAN-generated images, while other methods often need to trade much diversity for slightly improved Label Score.
We propose an unsupervised method for 3D geometry-aware representation learning of articulated objects. Though photorealistic images of articulated objects can be rendered with explicit pose control through existing 3D neural representations, these methods require ground truth 3D pose and foreground masks for training, which are expensive to obtain. We obviate this need by learning the representations with GAN training. From random poses and latent vectors, the generator is trained to produce realistic images of articulated objects by adversarial training. To avoid a large computational cost for GAN training, we propose an efficient neural representation for articulated objects based on tri-planes and then present a GAN-based framework for its unsupervised training. Experiments demonstrate the efficiency of our method and show that GAN-based training enables learning of controllable 3D representations without supervision.
In this work we present point-level region contrast, a self-supervised pre-training approach for the task of object detection. This approach is motivated by the two key factors in detection: localization and recognition. While accurate localization favors models that operate at the pixel- or point-level, correct recognition typically relies on a more holistic, region-level view of objects. Incorporating this perspective in pre-training, our approach performs contrastive learning by directly sampling individual point pairs from different regions. Compared to an aggregated representation per region, our approach is more robust to the change in input region quality, and further enables us to implicitly improve initial region assignments via online knowledge distillation during training. Both advantages are important when dealing with imperfect regions encountered in the unsupervised setting. Experiments show point-level region contrast improves on state-of-the-art pre-training methods for object detection and segmentation across multiple tasks and datasets, and we provide extensive ablation studies and visualizations to aid understanding. Code will be made available.
The expensive annotation cost is notoriously known as the main constraint for the development of the point cloud semantic segmentation technique. Active learning methods endeavor to reduce such cost by selecting and labeling only a subset of the point clouds, yet previous attempts ignore the spatial-structural diversity of the selected samples, inducing the model to select clustered candidates with similar shapes in a local area while missing other representative ones in the global environment. In this paper, we propose a new 3D region-based active learning method to tackle this problem. Dubbed SSDR-AL, our method groups the original point clouds into superpoints and incrementally selects the most informative and representative ones for label acquisition. We achieve the selection mechanism via a graph reasoning network that considers both the spatial and structural diversities of superpoints. To deploy SSDR-AL in a more practical scenario, we design a noise-aware iterative labeling strategy to confront the "noisy annotation" problem introduced by the previous "dominant labeling" strategy in superpoints. Extensive experiments on two point cloud benchmarks demonstrate the effectiveness of SSDR-AL in the semantic segmentation task. Particularly, SSDR-AL significantly outperforms the baseline method and reduces the annotation cost by up to 63.0% and 24.0% when achieving 90% performance of fully supervised learning, respectively.
Locating 3D objects from a single RGB image via Perspective-n-Points (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, so that 2D-3D point correspondences can be partly learned by backpropagating the gradient w.r.t. object pose. Yet, learning the entire set of unrestricted 2D-3D points from scratch fails to converge with existing approaches, since the deterministic pose is inherently non-differentiable. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose on the SE(3) manifold, essentially bringing categorical Softmax to the continuous domain. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle unifies the existing approaches and resembles the attention mechanism. EPro-PnP significantly outperforms competitive baselines, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation and nuScenes 3D object detection benchmarks.
Deep learning depends on large amounts of labeled training data. Manual labeling is expensive and represents a bottleneck, especially for tasks such as segmentation, where labels must be assigned down to the level of individual points. That challenge is even more daunting for 3D data: 3D point clouds contain millions of points per scene, and their accurate annotation is markedly more time-consuming. The situation is further aggravated by the added complexity of user interfaces for 3D point clouds, which slows down annotation even more. For the case of 2D image segmentation, interactive techniques have become common, where user feedback in the form of a few clicks guides a segmentation algorithm -- nowadays usually a neural network -- to achieve an accurate labeling with minimal effort. Surprisingly, interactive segmentation of 3D scenes has not been explored much. Previous work has attempted to obtain accurate 3D segmentation masks using human feedback from the 2D domain, which is only possible if correctly aligned images are available together with the 3D point cloud, and it involves switching between the 2D and 3D domains. Here, we present an interactive 3D object segmentation method in which the user interacts directly with the 3D point cloud. Importantly, our model does not require training data from the target domain: when trained on ScanNet, it performs well on several other datasets with different data characteristics as well as different object classes. Moreover, our method is orthogonal to supervised (instance) segmentation methods and can be combined with them to refine automatic segmentations with minimal human effort.
Generative adversarial networks (GANs) have been extensively studied in the past few years. Arguably the revolutionary techniques are in the area of computer vision such as plausible image generation, image to image translation, facial attribute manipulation and similar domains. Despite the significant success achieved in computer vision field, applying GANs over real-world problems still have three main challenges: (1) High quality image generation; (2) Diverse image generation; and (3) Stable training. Considering numerous GAN-related research in the literature, we provide a study on the architecture-variants and loss-variants, which are proposed to handle these three challenges from two perspectives. We propose loss and architecture-variants for classifying most popular GANs, and discuss the potential improvements with focusing on these two aspects. While several reviews for GANs have been presented, there is no work focusing on the review of GAN-variants based on handling challenges mentioned above. In this paper, we review and critically discuss 7 architecture-variant GANs and 9 loss-variant GANs for remedying those three challenges. The objective of this review is to provide an insight on the footprint that current GANs research focuses on the performance improvement. Code related to GAN-variants studied in this work is summarized on //github.com/sheqi/GAN_Review.
Distant supervision can effectively label data for relation extraction, but suffers from the noise labeling problem. Recent works mainly perform soft bag-level noise reduction strategies to find the relatively better samples in a sentence bag, which is suboptimal compared with making a hard decision of false positive samples in sentence level. In this paper, we introduce an adversarial learning framework, which we named DSGAN, to learn a sentence-level true-positive generator. Inspired by Generative Adversarial Networks, we regard the positive samples generated by the generator as the negative samples to train the discriminator. The optimal generator is obtained until the discrimination ability of the discriminator has the greatest decline. We adopt the generator to filter distant supervision training dataset and redistribute the false positive instances into the negative set, in which way to provide a cleaned dataset for relation classification. The experimental results show that the proposed strategy significantly improves the performance of distant supervision relation extraction comparing to state-of-the-art systems.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.