亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Recent advances in neural radiance fields (NeRFs) achieve state-of-the-art novel view synthesis and facilitate dense estimation of scene properties. However, NeRFs often fail for large, unbounded scenes that are captured under very sparse views with the scene content concentrated far away from the camera, as is typical for field robotics applications. In particular, NeRF-style algorithms perform poorly: (1) when there are insufficient views with little pose diversity, (2) when scenes contain saturation and shadows, and (3) when finely sampling large unbounded scenes with fine structures becomes computationally intensive. This paper proposes CLONeR, which significantly improves upon NeRF by allowing it to model large outdoor driving scenes that are observed from sparse input sensor views. This is achieved by decoupling occupancy and color learning within the NeRF framework into separate Multi-Layer Perceptrons (MLPs) trained using LiDAR and camera data, respectively. In addition, this paper proposes a novel method to build differentiable 3D Occupancy Grid Maps (OGM) alongside the NeRF model, and leverage this occupancy grid for improved sampling of points along a ray for volumetric rendering in metric space. Through extensive quantitative and qualitative experiments on scenes from the KITTI dataset, this paper demonstrates that the proposed method outperforms state-of-the-art NeRF models on both novel view synthesis and dense depth prediction tasks when trained on sparse input data.

相關內容

Recent advancements in the domain of text-to-image synthesis have culminated in a multitude of enhancements pertaining to quality, fidelity, and diversity. Contemporary techniques enable the generation of highly intricate visuals which rapidly approach near-photorealistic quality. Nevertheless, as progress is achieved, the complexity of these methodologies increases, consequently intensifying the comprehension barrier between individuals within the field and those external to it. In an endeavor to mitigate this disparity, we propose a streamlined approach for text-to-image generation, which encompasses both the training paradigm and the sampling process. Despite its remarkable simplicity, our method yields aesthetically pleasing images with few sampling iterations, allows for intriguing ways for conditioning the model, and imparts advantages absent in state-of-the-art techniques. To demonstrate the efficacy of this approach in achieving outcomes comparable to existing works, we have trained a one-billion parameter text-conditional model, which we refer to as "Paella". In the interest of fostering future exploration in this field, we have made our source code and models publicly accessible for the research community.

We present a novel method for populating 3D indoor scenes with virtual humans that can navigate the environment and interact with objects in a realistic manner. Existing approaches rely on high-quality training sequences that capture a diverse range of human motions in 3D scenes. However, such motion data is costly, difficult to obtain and can never cover the full range of plausible human-scene interactions in complex indoor environments. To address these challenges, we propose a reinforcement learning-based approach to learn policy networks that predict latent variables of a powerful generative motion model that is trained on a large-scale motion capture dataset (AMASS). For navigating in a 3D environment, we propose a scene-aware policy training scheme with a novel collision avoidance reward function. Combined with the powerful generative motion model, we can synthesize highly diverse human motions navigating 3D indoor scenes, meanwhile effectively avoiding obstacles. For detailed human-object interactions, we carefully curate interaction-aware reward functions by leveraging a marker-based body representation and the signed distance field (SDF) representation of the 3D scene. With a number of important training design schemes, our method can synthesize realistic and diverse human-object interactions (e.g.,~sitting on a chair and then getting up) even for out-of-distribution test scenarios with different object shapes, orientations, starting body positions, and poses. Experimental results demonstrate that our approach outperforms state-of-the-art human-scene interaction synthesis frameworks in terms of both motion naturalness and diversity. Video results are available on the project page: //zkf1997.github.io/DIMOS.

Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering. In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces' spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels. To apply this representation to monocular scene reconstruction, we develop a scale calibration algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects. Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods.

No significant work has been done to directly merge two partially overlapping scenes using NeRF representations. Given pre-trained NeRF models of a 3D scene with partial overlapping, this paper aligns them with a rigid transform, by generalizing the traditional registration pipeline, that is, key point detection and point set registration, to operate on 3D density fields. To describe corner points as key points in 3D, we propose to use universal pre-trained descriptor-generating neural networks that can be trained and tested on different scenes. We perform experiments to demonstrate that the descriptor networks can be conveniently trained using a contrastive learning strategy. We demonstrate that our method, as a global approach, can effectively register NeRF models, thus making possible future large-scale NeRF construction by registering its smaller and overlapping NeRFs captured individually.

Recently, several morphologies, each with its advantages, have been proposed for the \textit{GelSight} high-resolution tactile sensors. However, existing simulation methods are limited to flat-surface sensors, which prevents their usage with the newer sensors of non-flat morphologies in Sim2Real experiments. In this paper, we extend a previously proposed GelSight simulation method developed for flat-surface sensors and propose a novel method for curved sensors. In particular, we address the simulation of light rays travelling through a curved tactile membrane in the form of geodesic paths. The method is validated by simulating the finger-shaped GelTip sensor and comparing the generated synthetic tactile images against the corresponding real images. Our extensive experiments show that combining the illumination generated from the geodesic paths, with a background image from the real sensor, produces the best results when compared to the lighting generated by direct linear paths in the same conditions. As the method is parameterised by the sensor mesh, it can be applied in principle to simulate a tactile sensor of any morphology. The proposed method not only unlocks simulating existing optical tactile sensors of complex morphologies but also enables experimenting with sensors of novel morphologies, before the fabrication of the real sensor. Project website: //danfergo.github.io/geltip-sim

Neural Radiance Field training can be accelerated through the use of grid-based representations in NeRF's learned mapping from spatial coordinates to colors and volumetric density. However, these grid-based approaches lack an explicit understanding of scale and therefore often introduce aliasing, usually in the form of jaggies or missing scene content. Anti-aliasing has previously been addressed by mip-NeRF 360, which reasons about sub-volumes along a cone rather than points along a ray, but this approach is not natively compatible with current grid-based techniques. We show how ideas from rendering and signal processing can be used to construct a technique that combines mip-NeRF 360 and grid-based models such as Instant NGP to yield error rates that are 8% - 77% lower than either prior technique, and that trains 24x faster than mip-NeRF 360.

To implement autonomous driving, one essential step is to model the vehicle environment based on the sensor inputs. Radars, with their well-known advantages, became a popular option to infer the occupancy state of grid cells surrounding the vehicle. To tackle data sparsity and noise of radar detections, we propose a deep learning-based Inverse Sensor Model (ISM) to learn the mapping from sparse radar detections to polar measurement grids. Improved lidar-based measurement grids are used as reference. The learned radar measurement grids, combined with radar Doppler velocity measurements, are further used to generate a Dynamic Grid Map (DGM). Experiments in real-world highway scenarios show that our approach outperforms the hand-crafted geometric ISMs. In comparison to state-of-the-art deep learning methods, our approach is the first one to learn a single-frame measurement grid in the polar scheme from radars with a limited Field Of View (FOV). The learning framework makes the learned ISM independent of the radar mounting. This enables us to flexibly use one or more radar sensors without network retraining and without requirements on 360{\deg} sensor coverage.

Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are available at //xh621.github.io/stand-up-comedy-demo/

Artificial neural networks (ANNs) have emerged as an essential tool in machine learning, achieving remarkable success across diverse domains, including image and speech generation, game playing, and robotics. However, there exist fundamental differences between ANNs' operating mechanisms and those of the biological brain, particularly concerning learning processes. This paper presents a comprehensive review of current brain-inspired learning representations in artificial neural networks. We investigate the integration of more biologically plausible mechanisms, such as synaptic plasticity, to enhance these networks' capabilities. Moreover, we delve into the potential advantages and challenges accompanying this approach. Ultimately, we pinpoint promising avenues for future research in this rapidly advancing field, which could bring us closer to understanding the essence of intelligence.

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

北京阿比特科技有限公司