四虎亚洲精品高清在线观看_丰满人妻被公侵犯高清版_97久久超碰中文字幕潮喷户外_日韩欧美精品一区二区三区久久蜜_亚洲一区二区手机在线播放_色婷婷六月亚洲综合香蕉_国产日韩欧美另类视频

Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation, however, the generation of 3D scenes and even $360^{\circ}$ images remains constrained, due to the limited number of scene datasets, the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images. To address these issues, we first establish a large-scale panoramic video-text dataset containing millions of consecutive panoramic keyframes with corresponding panoramic depths, camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation. Specifically, benefiting from the powerful generative capabilities of stable diffusion, we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset. We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images. Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.

相關內容

關注 36

3D是(shi)英文“Three Dimensions”的(de)(de)簡稱，中文是(shi)指三(san)維、三(san)個(ge)維度(du)、三(san)個(ge)坐標，即有(you)(you)長、有(you)(you)寬(kuan)、有(you)(you)高，換句話說，就是(shi)立體的(de)(de)，是(shi)相對于(yu)只有(you)(you)長和寬(kuan)的(de)(de)平面（2D）而言。

可理解性 · MoDELS · 詞元分析器 · 語言模型化 · 多峰值 ·

2024 年 12 月 12 日

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Hao Li,Changyao Tian,Jie Shao,Xizhou Zhu,Zhaokai Wang,Jinguo Zhu,Wenhan Dou,Xiaogang Wang,Hongsheng Li,Lewei Lu,Jifeng Dai

The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.

3D · 控制器 · Learning · MoDELS · 查準率/準確率 ·

2024 年 12 月 12 日

LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

Yabo Chen,Chen Yang,Jiemin Fang,Xiaopeng Zhang,Lingxi Xie,Wei Shen,Wenrui Dai,Hongkai Xiong,Qi Tian

from arxiv, Project page: //liftimage3d.github.io/

Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.

MoDELS · Branch · Neural Networks · Processing（編程語言） · Networking ·

2024 年 12 月 12 日

BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics

Keyi Shen,Jiangwei Yu,Huan Zhang,Yunzhu Li

from arxiv, The first two authors contributed equally. Project Page: //robopil.github.io/bab-nd/

Neural-network-based dynamics models learned from observational data have shown strong predictive capabilities for scene dynamics in robotic manipulation tasks. However, their inherent non-linearity presents significant challenges for effective planning. Current planning methods, often dependent on extensive sampling or local gradient descent, struggle with long-horizon motion planning tasks involving complex contact events. In this paper, we present a GPU-accelerated branch-and-bound (BaB) framework for motion planning in manipulation tasks that require trajectory optimization over neural dynamics models. Our approach employs a specialized branching heuristics to divide the search space into subdomains, and applies a modified bound propagation method, inspired by the state-of-the-art neural network verifier alpha-beta-CROWN, to efficiently estimate objective bounds within these subdomains. The branching process guides planning effectively, while the bounding process strategically reduces the search space. Our framework achieves superior planning performance, generating high-quality state-action trajectories and surpassing existing methods in challenging, contact-rich manipulation tasks such as non-prehensile planar pushing with obstacles, object sorting, and rope routing in both simulated and real-world settings. Furthermore, our framework supports various neural network architectures, ranging from simple multilayer perceptrons to advanced graph neural dynamics models, and scales efficiently with different model sizes.

3D · 估計/估計量 · 真實值 · Projection · 縮放 ·

2024 年 12 月 12 日

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Wanshui Gan,Fang Liu,Hongbin Xu,Ningkai Mo,Naoto Yokoya

from arxiv, Project page: //ganwanshui.github.io/GaussianOcc/

We introduce GaussianOcc, a systematic method that investigates the two usages of Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps, semantic maps), which is both time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering). The relevant code is available in //github.com/GANWANSHUI/GaussianOcc.git.

3D · MoDELS · VR · Microsoft Surface · 數據集 ·

2024 年 12 月 11 日

Textured Mesh Saliency: Bridging Geometry and Texture for Human Perception in 3D Graphics

Kaiwei Zhang,Dandan Zhu,Xiongkuo Min,Guangtao Zhai

from arxiv, to be published in AAAI 2025

Textured meshes significantly enhance the realism and detail of objects by mapping intricate texture details onto the geometric structure of 3D models. This advancement is valuable across various applications, including entertainment, education, and industry. While traditional mesh saliency studies focus on non-textured meshes, our work explores the complexities introduced by detailed texture patterns. We present a new dataset for textured mesh saliency, created through an innovative eye-tracking experiment in a six degrees of freedom (6-DOF) VR environment. This dataset addresses the limitations of previous studies by providing comprehensive eye-tracking data from multiple viewpoints, thereby advancing our understanding of human visual behavior and supporting more accurate and effective 3D content creation. Our proposed model predicts saliency maps for textured mesh surfaces by treating each triangular face as an individual unit and assigning a saliency density value to reflect the importance of each local surface region. The model incorporates a texture alignment module and a geometric extraction module, combined with an aggregation module to integrate texture and geometry for precise saliency prediction. We believe this approach will enhance the visual fidelity of geometric processing algorithms while ensuring efficient use of computational resources, which is crucial for real-time rendering and high-detail applications such as VR and gaming.

語音合成 · MoDELS · TOOLS · 標注 · ForCES ·

2024 年 12 月 11 日

Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

Haowei Lou,Helen Paik,Wen Hu,Lina Yao

Recent advancements in text-to-speech (TTS) systems, such as FastSpeech and StyleSpeech, have significantly improved speech generation quality. However, these models often rely on duration generated by external tools like the Montreal Forced Aligner, which can be time-consuming and lack flexibility. The importance of accurate duration is often underestimated, despite their crucial role in achieving natural prosody and intelligibility. To address these limitations, we propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model. This approach reduces dependence on external tools and enhances alignment accuracy. We further explore the impact of different acoustic features, including Mel-Spectrograms, MFCCs, and latent features, on TTS model performance. Our experimental results show that aligner-guided duration labelling can achieve up to a 16\% improvement in word error rate and significantly enhance phoneme and tone alignment. These findings highlight the effectiveness of our approach in optimizing TTS systems for more natural and intelligible speech generation.

數據集 · 磁流變材料 · VR · 增強現實（AR） · 層 ·

2024 年 12 月 10 日

XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

Shuqing Li,Chenran Zhang,Cuiyun Gao,Michael R. Lyu

The rapid advancement of Extended Reality (XR, encompassing AR, MR, and VR) and spatial computing technologies forms a foundational layer for the emerging Metaverse, enabling innovative applications across healthcare, education, manufacturing, and entertainment. However, research in this area is often limited by the lack of large, representative, and highquality application datasets that can support empirical studies and the development of new approaches benefiting XR software processes. In this paper, we introduce XRZoo, a comprehensive and curated dataset of XR applications designed to bridge this gap. XRZoo contains 12,528 free XR applications, spanning nine app stores, across all XR techniques (i.e., AR, MR, and VR) and use cases, with detailed metadata on key aspects such as application descriptions, application categories, release dates, user review numbers, and hardware specifications, etc. By making XRZoo publicly available, we aim to foster reproducible XR software engineering and security research, enable cross-disciplinary investigations, and also support the development of advanced XR systems by providing examples to developers. Our dataset serves as a valuable resource for researchers and practitioners interested in improving the scalability, usability, and effectiveness of XR applications. XRZoo will be released and actively maintained.

GPU · 有向 · 查準率/準確率 · 確切的 · 線性的 ·

2024 年 12 月 10 日

Direct Low-Dose CT Image Reconstruction on GPU using Out-Of-Core: Precision and Quality Study

M. Chillarón,G. Quintana-Ortí,V. Vidal,G. Verdú

from arxiv, 22 pages, 12 figures, 9 tables

Algebraic methods applied to the reconstruction of Sparse-view Computed Tomography (CT) can provide both a high image quality and a decrease in the dose received by patients, although with an increased reconstruction time since their computational costs are higher. In our work, we present a new algebraic implementation that obtains an exact solution to the system of linear equations that models the problem and based on single-precision floating-point arithmetic. By applying Out-Of-Core (OOC) techniques, the dimensions of the system can be increased regardless of the main memory size and as long as there is enough secondary storage (disk). These techniques have allowed to process images of 768 x 768$ pixels. A comparative study of our method on a GPU using both single-precision and double-precision arithmetic has been carried out. The goal is to assess the single-precision arithmetic implementation both in terms of time improvement and quality of the reconstructed images to determine if it is sufficient to consider it a viable option. Results using single-precision arithmetic approximately halves the reconstruction time of the double-precision implementation, whereas the obtained images retain all internal structures despite having higher noise levels.

MoDELS · 控制器 · Integration · 多峰值 · 語言模型化 ·

2024 年 12 月 10 日

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Jianzong Wu,Chao Tang,Jingbo Wang,Yanhong Zeng,Xiangtai Li,Yunhai Tong

from arxiv, The project page is //jianzongwu.github.io/projects/diffsensei/

Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The project page is //jianzongwu.github.io/projects/diffsensei/.

在線 · 縮放 · MoDELS · INFORMS · INTERACT ·

2024 年 12 月 10 日

BayesCNS: A Unified Bayesian Approach to Address Cold Start and Non-Stationarity in Search Systems at Scale

Randy Ardywibowo,Rakesh Sunki,Lucy Kuo,Sankalp Nayak

Information Retrieval (IR) systems used in search and recommendation platforms frequently employ Learning-to-Rank (LTR) models to rank items in response to user queries. These models heavily rely on features derived from user interactions, such as clicks and engagement data. This dependence introduces cold start issues for items lacking user engagement and poses challenges in adapting to non-stationary shifts in user behavior over time. We address both challenges holistically as an online learning problem and propose BayesCNS, a Bayesian approach designed to handle cold start and non-stationary distribution shifts in search systems at scale. BayesCNS achieves this by estimating prior distributions for user-item interactions, which are continuously updated with new user interactions gathered online. This online learning procedure is guided by a ranker model, enabling efficient exploration of relevant items using contextual information provided by the ranker. We successfully deployed BayesCNS in a large-scale search system and demonstrated its efficacy through comprehensive offline and online experiments. Notably, an online A/B experiment showed a 10.60% increase in new item interactions and a 1.05% improvement in overall success metrics over the existing production baseline.