In this paper, we study the mathematical imaging problem of diffraction tomography (DT), which is an inverse scattering technique used to find material properties of an object by illuminating it with probing waves and recording the scattered waves. Conventional DT relies on the Fourier diffraction theorem, which is applicable under the condition of weak scattering. However, if the object has high contrasts or is too large compared to the wavelength, it tends to produce multiple scattering, which complicates the reconstruction. We give a survey on diffraction tomography and compare the reconstruction of low and high contrast objects. We also implement and compare the reconstruction using the full waveform inversion method which, contrary to the Born and Rytov approximations, works with the total field and is more robust to multiple scattering.
We address the problem of multi-person 3D body pose and shape estimation from a single image. While this problem can be addressed by applying single-person approaches multiple times for the same scene, recent works have shown the advantages of building upon deep architectures that simultaneously reason about all people in the scene in a holistic manner by enforcing, e.g., depth order constraints or minimizing interpenetration among reconstructed bodies. However, existing approaches are still unable to capture the size variability of people caused by the inherent body scale and depth ambiguity. In this work, we tackle this challenge by devising a novel optimization scheme that learns the appropriate body scale and relative camera pose, by enforcing the feet of all people to remain on the ground floor. A thorough evaluation on MuPoTS-3D and 3DPW datasets demonstrates that our approach is able to robustly estimate the body translation and shape of multiple people while retrieving their spatial arrangement, consistently improving current state-of-the-art, especially in scenes with people of very different heights
We show that reconstructing a curve in $\mathbb{R}^d$ for $d\geq 2$ from a $0.66$-sample is always possible using an algorithm similar to the classical NN-Crust algorithm. Previously, this was only known to be possible for $0.47$-samples in $\mathbb{R}^2$ and $\frac{1}{3}$-samples in $\mathbb{R}^d$ for $d\geq 3$. In addition, we show that there is not always a unique way to reconstruct a curve from a $0.72$-sample; this was previously only known for $1$-samples. We also extend this non-uniqueness result to hypersurfaces in all higher dimensions.
We develop novel methods for using persistent homology to infer the homology of an unknown Riemannian manifold $(M, g)$ from a point cloud sampled from an arbitrary smooth probability density function. Standard distance-based filtered complexes, such as the \v{C}ech complex, often have trouble distinguishing noise from features that are simply small. We address this problem by defining a family of "density-scaled filtered complexes" that includes a density-scaled \v{C}ech complex and a density-scaled Vietoris--Rips complex. We show that the density-scaled \v{C}ech complex is homotopy-equivalent to $M$ for filtration values in an interval whose starting point converges to $0$ in probability as the number of points $N \to \infty$ and whose ending point approaches infinity as $N \to \infty$. By contrast, the standard \v{C}ech complex may only be homotopy-equivalent to $M$ for a very small range of filtration values. The density-scaled filtered complexes also have the property that they are invariant under conformal transformations, such as scaling. We implement a filtered complex $\widehat{DVR}$ that approximates the density-scaled Vietoris--Rips complex, and we empirically test the performance of our implementation. As examples, we use $\widehat{DVR}$ to identify clusters that have different densities, and we apply $\widehat{DVR}$ to a time-delay embedding of the Lorenz dynamical system. Our implementation is stable (under conditions that are almost surely satisfied) and designed to handle outliers in the point cloud that do not lie on $M$.
Spinning LiDAR data are prevalent for 3D perception tasks, yet its cylindrical image form is less studied. Conventional approaches regard scans as point clouds, and they either rely on expensive Euclidean 3D nearest neighbor search for data association or depend on projected range images for further processing. We revisit the LiDAR scan formation and present a cylindrical range image representation for data from raw scans, equipped with an efficient calibrated spherical projective model. With our formulation, we 1) collect a large dataset of LiDAR data consisting of both indoor and outdoor sequences accompanied with pseudo-ground truth poses; 2) evaluate the projective and conventional registration approaches on the sequences with both synthetic and real-world transformations; 3) transfer state-of-the-art RGB-D algorithms to LiDAR that runs up to 180 Hz for registration and 150 Hz for dense reconstruction. The dataset and tools will be released.
Recently, data-driven single-view reconstruction methods have shown great progress in modeling 3D dressed humans. However, such methods suffer heavily from depth ambiguities and occlusions inherent to single view inputs. In this paper, we tackle this problem by considering a small set of input views and investigate the best strategy to suitably exploit information from these views. We propose a data-driven end-to-end approach that reconstructs an implicit 3D representation of dressed humans from sparse camera views. Specifically, we introduce three key components: first a spatially consistent reconstruction that allows for arbitrary placement of the person in the input views using a perspective camera model; second an attention-based fusion layer that learns to aggregate visual information from several viewpoints; and third a mechanism that encodes local 3D patterns under the multi-view context. In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively. To demonstrate the spatially consistent reconstruction, we apply our approach to dynamic scenes. Additionally, we apply our method on real data acquired with a multi-camera platform and demonstrate our approach can obtain results comparable to multi-view stereo with dramatically less views.
In this paper, we address an alternative formulation for the exact inverse formula of the Radon transform on circle arcs arising in a modality of Compton Scattering Tomography in translational geometry proposed by Webber and Miller (Inverse Problems (36)2, 025007, 2020). The original study proposes a first method of reconstruction, using the theory of Volterra integral equations. The numerical realization of such a type of inverse formula may exhibit some difficulties, mainly due to stability issues. Here, we provide a suitable formulation for exact inversion that can be straightforwardly implemented in the Fourier domain. Simulations are carried out to illustrate the efficiency of the proposed reconstruction algorithm.
3D Morphable Model (3DMM) based methods have achieved great success in recovering 3D face shapes from single-view images. However, the facial textures recovered by such methods lack the fidelity as exhibited in the input images. Recent work demonstrates high-quality facial texture recovering with generative networks trained from a large-scale database of high-resolution UV maps of face textures, which is hard to prepare and not publicly available. In this paper, we introduce a method to reconstruct 3D facial shapes with high-fidelity textures from single-view images in-the-wild, without the need to capture a large-scale face texture database. The main idea is to refine the initial texture generated by a 3DMM based method with facial details from the input image. To this end, we propose to use graph convolutional networks to reconstruct the detailed colors for the mesh vertices instead of reconstructing the UV map. Experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
Semantic reconstruction of indoor scenes refers to both scene understanding and object reconstruction. Existing works either address one part of this problem or focus on independent objects. In this paper, we bridge the gap between understanding and reconstruction, and propose an end-to-end solution to jointly reconstruct room layout, object bounding boxes and meshes from a single image. Instead of separately resolving scene understanding and object reconstruction, our method builds upon a holistic scene context and proposes a coarse-to-fine hierarchy with three components: 1. room layout with camera pose; 2. 3D object bounding boxes; 3. object meshes. We argue that understanding the context of each component can assist the task of parsing the others, which enables joint understanding and reconstruction. The experiments on the SUN RGB-D and Pix3D datasets demonstrate that our method consistently outperforms existing methods in indoor layout estimation, 3D object detection and mesh reconstruction.
We present a unified framework tackling two problems: class-specific 3D reconstruction from a single image, and generation of new 3D shape samples. These tasks have received considerable attention recently; however, existing approaches rely on 3D supervision, annotation of 2D images with keypoints or poses, and/or training with multiple views of each object instance. Our framework is very general: it can be trained in similar settings to these existing approaches, while also supporting weaker supervision scenarios. Importantly, it can be trained purely from 2D images, without ground-truth pose annotations, and with a single view per instance. We employ meshes as an output representation, instead of voxels used in most prior work. This allows us to exploit shading information during training, which previous 2D-supervised methods cannot. Thus, our method can learn to generate and reconstruct concave object classes. We evaluate our approach on synthetic data in various settings, showing that (i) it learns to disentangle shape from pose; (ii) using shading in the loss improves performance; (iii) our model is comparable or superior to state-of-the-art voxel-based approaches on quantitative metrics, while producing results that are visually more pleasing; (iv) it still performs well when given supervision weaker than in prior works.
Limited capture range, and the requirement to provide high quality initialization for optimization-based 2D/3D image registration methods, can significantly degrade the performance of 3D image reconstruction and motion compensation pipelines. Challenging clinical imaging scenarios, which contain significant subject motion such as fetal in-utero imaging, complicate the 3D image and volume reconstruction process. In this paper we present a learning based image registration method capable of predicting 3D rigid transformations of arbitrarily oriented 2D image slices, with respect to a learned canonical atlas co-ordinate system. Only image slice intensity information is used to perform registration and canonical alignment, no spatial transform initialization is required. To find image transformations we utilize a Convolutional Neural Network (CNN) architecture to learn the regression function capable of mapping 2D image slices to a 3D canonical atlas space. We extensively evaluate the effectiveness of our approach quantitatively on simulated Magnetic Resonance Imaging (MRI), fetal brain imagery with synthetic motion and further demonstrate qualitative results on real fetal MRI data where our method is integrated into a full reconstruction and motion compensation pipeline. Our learning based registration achieves an average spatial prediction error of 7 mm on simulated data and produces qualitatively improved reconstructions for heavily moving fetuses with gestational ages of approximately 20 weeks. Our model provides a general and computationally efficient solution to the 2D/3D registration initialization problem and is suitable for real-time scenarios.