夏娃韩剧电视剧在剧免费韩剧TV_亚洲主播福利视频网_亚洲V欧美V日韩V中文在线观看_免费无码大黄网站_亚洲国产精品综合色在线_在线播放国产精品免费VA_国产一级毛片久久久久久久女18

Score distillation sampling (SDS) has emerged as an effective framework in text-driven 3D editing tasks, leveraging diffusion models for 3D consistent editing. However, existing SDS-based 3D editing methods suffer from long training times and produce low-quality results. We identify that the root cause of this performance degradation is their conflict with the sampling dynamics of diffusion models. Addressing this conflict allows us to treat SDS as a diffusion reverse process for 3D editing via sampling from data space. In contrast, existing methods naively distill the score function using diffusion models. From these insights, we propose DreamCatalyst, a novel framework that considers these sampling dynamics in the SDS framework. Specifically, we devise the optimization process of our DreamCatalyst to approximate the diffusion reverse process in editing tasks, thereby aligning with diffusion sampling dynamics. As a result, DreamCatalyst successfully reduces training time and improves editing quality. Our method offers two modes: (1) a fast mode that edits Neural Radiance Fields (NeRF) scenes approximately 23 times faster than current state-of-the-art NeRF editing methods, and (2) a high-quality mode that produces superior results about 8 times faster than these methods. Notably, our high-quality mode outperforms current state-of-the-art NeRF editing methods in terms of both speed and quality. DreamCatalyst also surpasses the state-of-the-art 3D Gaussian Splatting (3DGS) editing methods, establishing itself as an effective and model-agnostic 3D editing solution. See more extensive results on our project page: //dream-catalyst.github.io.

相關內容

關注 36

3D是英文“Three Dimensions”的簡稱，中文是指三(san)維(wei)、三(san)個(ge)維(wei)度、三(san)個(ge)坐(zuo)標，即(ji)有長、有寬(kuan)、有高，換(huan)句話說，就是立(li)體的，是相對于只有長和(he)寬(kuan)的平面（2D）而(er)言。

小樣本學習 · 知識 (knowledge) · MoDELS · 蒸餾 · Learning ·

2024 年 11 月 8 日

Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

Nan Song,Xiaofeng Yang,Ze Yang,Guosheng Lin

Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data while preserving old knowledge. Current customization diffusion models excel in few-shot tasks but struggle with catastrophic forgetting problems in lifelong generations. In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting. To address these challenges, we first devise a data-free knowledge distillation strategy to tackle relevant concepts forgetting. Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones, without accessing any previous data. Second, we develop an In-Context Generation (ICGen) paradigm that allows the diffusion model to be conditioned upon the input vision context, which facilitates the few-shot generation and mitigates the issue of previous concepts forgetting. Extensive experiments show that the proposed Lifelong Few-Shot Diffusion (LFS-Diffusion) method can produce high-quality and accurate images while maintaining previously learned knowledge.

MoDELS · FAST · 解碼 · 語言模型化 · 對數幾率 ·

2024 年 11 月 8 日

An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking

Zijian Chen,Ronak Pradeep,Jimmy Lin

Recent advances have demonstrated that large language models (LLMs) excel as listwise rerankers, but their high computational demands remain a barrier to widespread adoption. Further, the traditional language modeling (LM) objective is not ideally suited for reranking tasks. FIRST is a novel approach that addresses these challenges by integrating a learning-to-rank objective and leveraging the logits of only the first generated token, thereby significantly reducing inference latency compared to traditional LLM rerankers. In this study, we extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains. We investigate the influence of different first-stage retrievers on FIRST rerankers, observing diminishing returns and patterns consistent with traditional LLM rerankers. Through applying the FIRST objective to a broader range of backbone models, we achieve effectiveness surpassing the original implementation. Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality. To better quantify the computational savings in the original study, we measure and compare latency to find a 21%-42% gain across various models and benchmarks. Moreover, while LM training implicitly improves zero-shot single-token reranking, our experiments also raise questions about whether LM pre-training may hinder subsequent fine-tuning with the FIRST objective. These findings pave the way for more efficient and effective listwise reranking in future applications.

MoDELS · 基準 · Performer · 情景 · 稀疏 ·

2024 年 11 月 7 日

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang,Lili Yu,Liang Luo,Srinivasan Iyer,Ning Dong,Chunting Zhou,Gargi Ghosh,Mike Lewis,Wen-tau Yih,Luke Zettlemoyer,Xi Victoria Lin

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

規范化的 · 3D · 優化器 · 正則化項 · 數據集 ·

2024 年 11 月 7 日

DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing

Matias Turkulainen,Xuqian Ren,Iaroslav Melekhov,Otto Seiskari,Esa Rahtu,Juho Kannala

from arxiv, To be published in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

High-fidelity 3D reconstruction of common indoor scenes is crucial for VR and AR applications. 3D Gaussian splatting, a novel differentiable rendering technique, has achieved state-of-the-art novel view synthesis results with high rendering speeds and relatively low training times. However, its performance on scenes commonly seen in indoor datasets is poor due to the lack of geometric constraints during optimization. In this work, we explore the use of readily accessible geometric cues to enhance Gaussian splatting optimization in challenging, ill-posed, and textureless scenes. We extend 3D Gaussian splatting with depth and normal cues to tackle challenging indoor datasets and showcase techniques for efficient mesh extraction. Specifically, we regularize the optimization procedure with depth information, enforce local smoothness of nearby Gaussians, and use off-the-shelf monocular networks to achieve better alignment with the true scene geometry. We propose an adaptive depth loss based on the gradient of color images, improving depth estimation and novel view synthesis results over various baselines. Our simple yet effective regularization technique enables direct mesh extraction from the Gaussian representation, yielding more physically accurate reconstructions of indoor scenes.

跡 · ReQuEST · INFORMS · 可約的 · Analysis ·

2024 年 11 月 7 日

Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysis

Haiyu Huang,Cheng Chen,Kunyi Chen,Pengfei Chen,Guangba Yu,Zilong He,Yilun Wang,Huxing Zhang,Qi Zhou

from arxiv, Accepted by ASPLOS'25

Distributed traces contain valuable information but are often massive in volume, posing a core challenge in tracing framework design: balancing the tradeoff between preserving essential trace information and reducing trace volume. To address this tradeoff, previous approaches typically used a '1 or 0' sampling strategy: retaining sampled traces while completely discarding unsampled ones. However, based on an empirical study on real-world production traces, we discover that the '1 or 0' strategy actually fails to effectively balance this tradeoff. To achieve a more balanced outcome, we shift the strategy from the '1 or 0' paradigm to the 'commonality + variability' paradigm. The core of 'commonality + variability' paradigm is to first parse traces into common patterns and variable parameters, then aggregate the patterns and filter the parameters. We propose a cost-efficient tracing framework, Mint, which implements the 'commonality + variability' paradigm on the agent side to enable all requests capturing. Our experiments show that Mint can capture all traces and retain more trace information while optimizing trace storage (reduced to an average of 2.7%) and network overhead (reduced to an average of 4.2%). Moreover, experiments also demonstrate that Mint is lightweight enough for production use.

MoDELS · 潛在 · 優化器 · 語言模型化 · 大語言模型 ·

2024 年 11 月 6 日

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Haolin Chen,Yihao Feng,Zuxin Liu,Weiran Yao,Akshara Prabhakar,Shelby Heinecke,Ricky Ho,Phil Mui,Silvio Savarese,Caiming Xiong,Huan Wang

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{//github.com/SalesforceAIResearch/LaTRO}.

anchor · 代碼 · 近鄰 · 估計/估計量 · Processing（編程語言） ·

2024 年 11 月 6 日

Inter-Frame Coding for Dynamic Meshes via Coarse-to-Fine Anchor Mesh Generation

He Huang,Lizhi Hou,Qi Yang,Yiling Xu

In the current Video-based Dynamic Mesh Coding (V-DMC) standard, inter-frame coding is restricted to mesh frames with constant topology. Consequently, temporal redundancy is not fully leveraged, resulting in suboptimal compression efficacy. To address this limitation, this paper introduces a novel coarse-to-fine scheme to generate anchor meshes for frames with time-varying topology. Initially, we generate a coarse anchor mesh using an octree-based nearest neighbor search. Motion estimation compensates for regions with significant motion changes during this process. However, the quality of the coarse mesh is low due to its suboptimal vertices. To enhance details, the fine anchor mesh is further optimized using the Quadric Error Metrics (QEM) algorithm to calculate more precise anchor points. The inter-frame anchor mesh generated herein retains the connectivity of the reference base mesh, while concurrently preserving superior quality. Experimental results show that our method achieves 7.2% ~ 10.3% BD-rate gain compared to the existing V-DMC test model version 7.

流 · 可理解性 · 講稿 · 多峰值 · 語言模型化 ·

2024 年 11 月 6 日

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

Junming Lin,Zheng Fang,Chi Chen,Zihao Wan,Fuwen Luo,Peng Li,Yang Liu,Maosong Sun

The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios.

圖形處理器 · 圖 · Better · Neural Networks · 視覺問答 ·

2020 年 3 月 31 日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Difei Gao,Ke Li,Ruiping Wang,Shiguang Shan,Xilin Chen

from arxiv, Published as a CVPR2020 paper

Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.

損失函數（機器學習） · 學習的學習 · 學成 · entity · 泛函 ·

2019 年 9 月 9 日

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Jiawei Wu,Wenhan Xiong,William Yang Wang

from arxiv, 11pages, 5 figures, accepted to EMNLP 2019

Many tasks in natural language processing can be viewed as multi-label classification problems. However, most of the existing models are trained with the standard cross-entropy loss function and use a fixed prediction policy (e.g., a threshold of 0.5) for all the labels, which completely ignores the complexity and dependencies among different labels. In this paper, we propose a meta-learning method to capture these complex label dependencies. More specifically, our method utilizes a meta-learner to jointly learn the training policies and prediction policies for different labels. The training policies are then used to train the classifier with the cross-entropy loss function, and the prediction policies are further implemented for prediction. Experimental results on fine-grained entity typing and text classification demonstrate that our proposed method can obtain more accurate multi-label classification results.