亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

3D reconstruction aims to reconstruct 3D objects from 2D views. Previous works for 3D reconstruction mainly focus on feature matching between views or using CNNs as backbones. Recently, Transformers have been shown effective in multiple applications of computer vision. However, whether or not Transformers can be used for 3D reconstruction is still unclear. In this paper, we fill this gap by proposing 3D-RETR, which is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. 3D-RETR then uses another Transformer Decoder to obtain the voxel features. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects. 3D-RETR is capable of 3D reconstruction from a single view or multiple views. Experimental results on two datasets show that 3DRETR reaches state-of-the-art performance on 3D reconstruction. Additional ablation study also demonstrates that 3D-DETR benefits from using Transformers.

相關內容

在計(ji)算機(ji)(ji)視覺中, 三(san)維(wei)(wei)重(zhong)(zhong)建(jian)是(shi)指(zhi)根據單(dan)視圖(tu)或者多(duo)視圖(tu)的(de)(de)(de)(de)圖(tu)像(xiang)重(zhong)(zhong)建(jian)三(san)維(wei)(wei)信(xin)息(xi)(xi)的(de)(de)(de)(de)過(guo)程(cheng). 由于單(dan)視頻的(de)(de)(de)(de)信(xin)息(xi)(xi)不(bu)完全(quan),因此三(san)維(wei)(wei)重(zhong)(zhong)建(jian)需要利(li)用(yong)經驗知(zhi)識(shi). 而多(duo)視圖(tu)的(de)(de)(de)(de)三(san)維(wei)(wei)重(zhong)(zhong)建(jian)(類似人的(de)(de)(de)(de)雙(shuang)目定位)相(xiang)對(dui)比(bi)較(jiao)容易, 其方法是(shi)先對(dui)攝(she)像(xiang)機(ji)(ji)進行標定, 即計(ji)算出(chu)攝(she)像(xiang)機(ji)(ji)的(de)(de)(de)(de)圖(tu)象(xiang)(xiang)坐標系與世界(jie)坐標系的(de)(de)(de)(de)關系.然后利(li)用(yong)多(duo)個二維(wei)(wei)圖(tu)象(xiang)(xiang)中的(de)(de)(de)(de)信(xin)息(xi)(xi)重(zhong)(zhong)建(jian)出(chu)三(san)維(wei)(wei)信(xin)息(xi)(xi)。 物(wu)體(ti)(ti)三(san)維(wei)(wei)重(zhong)(zhong)建(jian)是(shi)計(ji)算機(ji)(ji)輔助幾(ji)(ji)(ji)何設(she)計(ji)(CAGD)、計(ji)算機(ji)(ji)圖(tu)形學(xue)(xue)(CG)、計(ji)算機(ji)(ji)動畫、計(ji)算機(ji)(ji)視覺、醫學(xue)(xue)圖(tu)像(xiang)處(chu)(chu)理(li)、科(ke)學(xue)(xue)計(ji)算和(he)虛(xu)擬(ni)現實(shi)(shi)、數字媒體(ti)(ti)創作等(deng)領域(yu)的(de)(de)(de)(de)共性科(ke)學(xue)(xue)問(wen)題(ti)和(he)核(he)心(xin)技術(shu)。在計(ji)算機(ji)(ji)內生(sheng)成物(wu)體(ti)(ti)三(san)維(wei)(wei)表示主要有兩(liang)類方法。一(yi)(yi)(yi)(yi)類是(shi)使用(yong)幾(ji)(ji)(ji)何建(jian)模軟(ruan)(ruan)件(jian)通過(guo)人機(ji)(ji)交互(hu)生(sheng)成人為控制(zhi)下的(de)(de)(de)(de)物(wu)體(ti)(ti)三(san)維(wei)(wei)幾(ji)(ji)(ji)何模型,另(ling)一(yi)(yi)(yi)(yi)類是(shi)通過(guo)一(yi)(yi)(yi)(yi)定的(de)(de)(de)(de)手段獲取(qu)(qu)真實(shi)(shi)物(wu)體(ti)(ti)的(de)(de)(de)(de)幾(ji)(ji)(ji)何形狀(zhuang)。前者實(shi)(shi)現技術(shu)已經十(shi)分(fen)成熟,現有若干軟(ruan)(ruan)件(jian)支持,比(bi)如:3DMAX、Maya、AutoCAD、UG等(deng)等(deng),它們(men)一(yi)(yi)(yi)(yi)般使用(yong)具有數學(xue)(xue)表達(da)式的(de)(de)(de)(de)曲線曲面表示幾(ji)(ji)(ji)何形狀(zhuang)。后者一(yi)(yi)(yi)(yi)般稱為三(san)維(wei)(wei)重(zhong)(zhong)建(jian)過(guo)程(cheng),三(san)維(wei)(wei)重(zhong)(zhong)建(jian)是(shi)指(zhi)利(li)用(yong)二維(wei)(wei)投影恢復物(wu)體(ti)(ti)三(san)維(wei)(wei)信(xin)息(xi)(xi)(形狀(zhuang)等(deng))的(de)(de)(de)(de)數學(xue)(xue)過(guo)程(cheng)和(he)計(ji)算機(ji)(ji)技術(shu),包括數據獲取(qu)(qu)、預(yu)處(chu)(chu)理(li)、點云(yun)拼接和(he)特(te)征分(fen)析等(deng)步(bu)驟。

We consider the reconstruction problem of video compressive sensing (VCS) under the deep unfolding/rolling structure. Yet, we aim to build a flexible and concise model using minimum stages. Different from existing deep unfolding networks used for inverse problems, where more stages are used for higher performance but without flexibility to different masks and scales, hereby we show that a 2-stage deep unfolding network can lead to the state-of-the-art (SOTA) results (with a 1.7dB gain in PSNR over the single stage model, RevSCI) in VCS. The proposed method possesses the properties of adaptation to new masks and ready to scale to large data without any additional training thanks to the advantages of deep unfolding. Furthermore, we extend the proposed model for color VCS to perform joint reconstruction and demosaicing. Experimental results demonstrate that our 2-stage model has also achieved SOTA on color VCS reconstruction, leading to a >2.3dB gain in PSNR over the previous SOTA algorithm based on plug-and-play framework, meanwhile speeds up the reconstruction by >17 times. In addition, we have found that our network is also flexible to the mask modulation and scale size for color VCS reconstruction so that a single trained network can be applied to different hardware systems. The code and models will be released to the public.

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining. We will release code and pretrained checkpoints.

Sketches are the most abstract 2D representations of real-world objects. Although a sketch usually has geometrical distortion and lacks visual cues, humans can effortlessly envision a 3D object from it. This suggests that sketches encode the information necessary for reconstructing 3D shapes. Despite great progress achieved in 3D reconstruction from distortion-free line drawings, such as CAD and edge maps, little effort has been made to reconstruct 3D shapes from free-hand sketches. We study this task and aim to enhance the power of sketches in 3D-related applications such as interactive design and VR/AR games. Unlike previous works, which mostly study distortion-free line drawings, our 3D shape reconstruction is based on free-hand sketches. A major challenge for free-hand sketch 3D reconstruction comes from the insufficient training data and free-hand sketch diversity, e.g. individualized sketching styles. We thus propose data generation and standardization mechanisms. Instead of distortion-free line drawings, synthesized sketches are adopted as input training data. Additionally, we propose a sketch standardization module to handle different sketch distortions and styles. Extensive experiments demonstrate the effectiveness of our model and its strong generalizability to various free-hand sketches. Our code is publicly available at //github.com/samaonline/3D-Shape-Reconstruction-from-Free-Hand-Sketches.

Recently, self-supervised vision transformers have attracted unprecedented attention for their impressive representation learning ability. However, the dominant method, contrastive learning, mainly relies on an instance discrimination pretext task, which learns a global understanding of the image. This paper incorporates local feature learning into self-supervised vision transformers via Reconstructive Pre-training (RePre). Our RePre extends contrastive frameworks by adding a branch for reconstructing raw image pixels in parallel with the existing contrastive objective. RePre is equipped with a lightweight convolution-based decoder that fuses the multi-hierarchy features from the transformer encoder. The multi-hierarchy features provide rich supervisions from low to high semantic information, which are crucial for our RePre. Our RePre brings decent improvements on various contrastive frameworks with different vision transformer architectures. Transfer performance in downstream tasks outperforms supervised pre-training and state-of-the-art (SOTA) self-supervised counterparts.

In this paper, we present a novel double diffusion based neural radiance field, dubbed DD-NeRF, to reconstruct human body geometry and render the human body appearance in novel views from a sparse set of images. We first propose a double diffusion mechanism to achieve expressive representations of input images by fully exploiting human body priors and image appearance details at two levels. At the coarse level, we first model the coarse human body poses and shapes via an unclothed 3D deformable vertex model as guidance. At the fine level, we present a multi-view sampling network to capture subtle geometric deformations and image detailed appearances, such as clothing and hair, from multiple input views. Considering the sparsity of the two level features, we diffuse them into feature volumes in the canonical space to construct neural radiance fields. Then, we present a signed distance function (SDF) regression network to construct body surfaces from the diffused features. Thanks to our double diffused representations, our method can even synthesize novel views of unseen subjects. Experiments on various datasets demonstrate that our approach outperforms the state-of-the-art in both geometric reconstruction and novel view synthesis.

3D Morphable Model (3DMM) based methods have achieved great success in recovering 3D face shapes from single-view images. However, the facial textures recovered by such methods lack the fidelity as exhibited in the input images. Recent work demonstrates high-quality facial texture recovering with generative networks trained from a large-scale database of high-resolution UV maps of face textures, which is hard to prepare and not publicly available. In this paper, we introduce a method to reconstruct 3D facial shapes with high-fidelity textures from single-view images in-the-wild, without the need to capture a large-scale face texture database. The main idea is to refine the initial texture generated by a 3DMM based method with facial details from the input image. To this end, we propose to use graph convolutional networks to reconstruct the detailed colors for the mesh vertices instead of reconstructing the UV map. Experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.

In this paper, we proposed a new deep learning based dense monocular SLAM method. Compared to existing methods, the proposed framework constructs a dense 3D model via a sparse to dense mapping using learned surface normals. With single view learned depth estimation as prior for monocular visual odometry, we obtain both accurate positioning and high quality depth reconstruction. The depth and normal are predicted by a single network trained in a tightly coupled manner.Experimental results show that our method significantly improves the performance of visual tracking and depth prediction in comparison to the state-of-the-art in deep monocular dense SLAM.

Single-image piece-wise planar 3D reconstruction aims to simultaneously segment plane instances and recover 3D plane parameters from an image. Most recent approaches leverage convolutional neural networks (CNNs) and achieve promising results. However, these methods are limited to detecting a fixed number of planes with certain learned order. To tackle this problem, we propose a novel two-stage method based on associative embedding, inspired by its recent success in instance segmentation. In the first stage, we train a CNN to map each pixel to an embedding space where pixels from the same plane instance have similar embeddings. Then, the plane instances are obtained by grouping the embedding vectors in planar regions via an efficient mean shift clustering algorithm. In the second stage, we estimate the parameter for each plane instance by considering both pixel-level and instance-level consistencies. With the proposed method, we are able to detect an arbitrary number of planes. Extensive experiments on public datasets validate the effectiveness and efficiency of our method. Furthermore, our method runs at 30 fps at the testing time, thus could facilitate many real-time applications such as visual SLAM and human-robot interaction. Code is available at //github.com/svip-lab/PlanarReconstruction.

With the advent of deep neural networks, learning-based approaches for 3D reconstruction have gained popularity. However, unlike for images, in 3D there is no canonical representation which is both computationally and memory efficient yet allows for representing high-resolution geometry of arbitrary topology. Many of the state-of-the-art learning-based 3D reconstruction approaches can hence only represent very coarse 3D geometry or are limited to a restricted domain. In this paper, we propose occupancy networks, a new representation for learning-based 3D reconstruction methods. Occupancy networks implicitly represent the 3D surface as the continuous decision boundary of a deep neural network classifier. In contrast to existing approaches, our representation encodes a description of the 3D output at infinite resolution without excessive memory footprint. We validate that our representation can efficiently encode 3D structure and can be inferred from various kinds of input. Our experiments demonstrate competitive results, both qualitatively and quantitatively, for the challenging tasks of 3D reconstruction from single images, noisy point clouds and coarse discrete voxel grids. We believe that occupancy networks will become a useful tool in a wide variety of learning-based 3D tasks.

We present a unified framework tackling two problems: class-specific 3D reconstruction from a single image, and generation of new 3D shape samples. These tasks have received considerable attention recently; however, existing approaches rely on 3D supervision, annotation of 2D images with keypoints or poses, and/or training with multiple views of each object instance. Our framework is very general: it can be trained in similar settings to these existing approaches, while also supporting weaker supervision scenarios. Importantly, it can be trained purely from 2D images, without ground-truth pose annotations, and with a single view per instance. We employ meshes as an output representation, instead of voxels used in most prior work. This allows us to exploit shading information during training, which previous 2D-supervised methods cannot. Thus, our method can learn to generate and reconstruct concave object classes. We evaluate our approach on synthetic data in various settings, showing that (i) it learns to disentangle shape from pose; (ii) using shading in the loss improves performance; (iii) our model is comparable or superior to state-of-the-art voxel-based approaches on quantitative metrics, while producing results that are visually more pleasing; (iv) it still performs well when given supervision weaker than in prior works.

北京阿比特科技有限公司