欧美综合一本热第九页,国产青年男男GV,日韩高清视频天堂在线观看,欧美日韩精品成人网站二区A有,国产大片免费观看免费网

Spatial audio in Extended Reality (XR) provides users with better awareness of where virtual elements are placed, and efficiently guides them to events such as notifications, system alerts from different windows, or approaching avatars. Humans, however, are inaccurate in localizing sound cues, especially with multiple sources due to limitations in human auditory perception such as angular discrimination error and front-back confusion. This decreases the efficiency of XR interfaces because users misidentify from which XR element a sound is coming. To address this, we propose Auptimize, a novel computational approach for placing XR sound sources, which mitigates such localization errors by utilizing the ventriloquist effect. Auptimize disentangles the sound source locations from the visual elements and relocates the sound sources to optimal positions for unambiguous identification of sound cues, avoiding errors due to inter-source proximity and front-back confusion. Our evaluation shows that Auptimize decreases spatial audio-based source identification errors compared to playing sound cues at the paired visual-sound locations. We demonstrate the applicability of Auptimize for diverse spatial audio-based interactive XR scenarios.

相關內容

優化器

關注 4

INFORMS · 相似度 · 推薦系統 · 表示 · Learning ·

2024 年 10 月 1 日

FELRec: Efficient Handling of Item Cold-Start With Dynamic Representation Learning in Recommender Systems

Kuba Weimann,Tim O. F. Conrad

Recommender systems suffer from the cold-start problem whenever a new user joins the platform or a new item is added to the catalog. To address item cold-start, we propose to replace the embedding layer in sequential recommenders with a dynamic storage that has no learnable weights and can keep an arbitrary number of representations. In this paper, we present FELRec, a large embedding network that refines the existing representations of users and items in a recursive manner, as new information becomes available. In contrast to similar approaches, our model represents new users and items without side information and time-consuming finetuning, instead it runs a single forward pass over a sequence of existing representations. During item cold-start, our method outperforms similar method by 29.50%-47.45%. Further, our proposed model generalizes well to previously unseen datasets in zero-shot settings. The source code is publicly available at //github.com/kweimann/FELRec .

多峰值 · Learning · 數據集 · MoDELS · 多樣性 ·

2024 年 10 月 1 日

LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data

Grigor Bezirganyan,Sana Sellami,Laure Berti-équille,Sébastien Fournier

Multimodal Deep Learning enhances decision-making by integrating diverse information sources, such as texts, images, audio, and videos. To develop trustworthy multimodal approaches, it is essential to understand how uncertainty impacts these models. We propose LUMA, a unique benchmark dataset, featuring audio, image, and textual data from 50 classes, for learning from uncertain and multimodal data. It extends the well-known CIFAR 10/100 dataset with audio samples extracted from three audio corpora, and text data generated using the Gemma-7B Large Language Model (LLM). The LUMA dataset enables the controlled injection of varying types and degrees of uncertainty to achieve and tailor specific experiments and benchmarking initiatives. LUMA is also available as a Python package including the functions for generating multiple variants of the dataset with controlling the diversity of the data, the amount of noise for each modality, and adding out-of-distribution samples. A baseline pre-trained model is also provided alongside three uncertainty quantification methods: Monte-Carlo Dropout, Deep Ensemble, and Reliable Conflictive Multi-View Learning. This comprehensive dataset and its benchmarking tools are intended to promote and support the development, evaluation, and benchmarking of trustworthy and robust multimodal deep learning approaches. We anticipate that the LUMA dataset will help the ICLR community to design more trustworthy and robust machine learning approaches for safety critical applications.

Guidance · 曲率 · Attention · 可約的 · 平滑 ·

2024 年 10 月 1 日

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

Susung Hong

from arxiv, Accepted to NeurIPS 2024

Conditional diffusion models have shown remarkable success in visual content generation, producing high-quality samples across various domains, largely due to classifier-free guidance (CFG). Recent attempts to extend guidance to unconditional models have relied on heuristic techniques, resulting in suboptimal generation quality and unintended effects. In this work, we propose Smoothed Energy Guidance (SEG), a novel training- and condition-free approach that leverages the energy-based perspective of the self-attention mechanism to enhance image generation. By defining the energy of self-attention, we introduce a method to reduce the curvature of the energy landscape of attention and use the output as the unconditional prediction. Practically, we control the curvature of the energy landscape by adjusting the Gaussian kernel parameter while keeping the guidance scale parameter fixed. Additionally, we present a query blurring method that is equivalent to blurring the entire attention weights without incurring quadratic complexity in the number of tokens. In our experiments, SEG achieves a Pareto improvement in both quality and the reduction of side effects. The code is available at //github.com/SusungHong/SEG-SDXL.

控制器 · 掩碼 · Attention · MoDELS · 評論員 ·

2024 年 9 月 30 日

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

Lingling Cai,Kang Zhao,Hangjie Yuan,Yingya Zhang,Shiwei Zhang,Kejie Huang

from arxiv, Video Editing

Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

多峰值 · Prompt · INTERACT · MoDELS · 語言模型化 ·

2024 年 9 月 30 日

POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models

Jianben He,Xingbo Wang,Shiyi Liu,Guande Wu,Claudio Silva,Huamin Qu

from arxiv, 11 pages, 6 figures

Large language models (LLMs) have exhibited impressive abilities for multimodal content comprehension and reasoning with proper prompting in zero- or few-shot settings. Despite the proliferation of interactive systems developed to support prompt engineering for LLMs across various tasks, most have primarily focused on textual or visual inputs, thus neglecting the complex interplay between modalities within multimodal inputs. This oversight hinders the development of effective prompts that guide model multimodal reasoning processes by fully exploiting the rich context provided by multiple modalities. In this paper, we present POEM, a visual analytics system to facilitate efficient prompt engineering for enhancing the multimodal reasoning performance of LLMs. The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts. Through diverse recommendations of demonstration examples and instructional principles, POEM supports users in iteratively crafting and refining prompts to better align and enhance model knowledge with human insights. The effectiveness and efficiency of our system are validated through two case studies and interviews with experts.

Networking · 估計/估計量 · QoE · 泛化理論 · Continuity ·

2024 年 9 月 30 日

Balancing Generalization and Specialization: Offline Metalearning for Bandwidth Estimation

Aashish Gottipati,Sami Khairy,Yasaman Hosseinkashi,Gabriel Mittag,Vishak Gopal,Francis Y. Yan,Ross Cutler

from arxiv, 11 pages, in review

User experience in real-time video applications requires continuously adjusting video encoding bitrates to match available network capacity, which hinges on accurate bandwidth estimation (BWE). However, network heterogeneity prevents a one-size-fits-all solution to BWE, motivating the demand for personalized approaches. Although personalizing BWE algorithms offers benefits such as improved adaptability to individual network conditions, it faces the challenge of data drift -- where estimators degrade over time due to evolving network environments. To address this, we introduce Ivy, a novel method for BWE that leverages offline metalearning to tackle data drift and maximize end-user Quality of Experience (QoE). Our key insight is that dynamically selecting the most suitable BWE algorithm for current network conditions allows for more effective adaption to changing environments. Ivy is trained entirely offline using Implicit Q-learning, enabling it to learn from individual network conditions without a single, live videoconferencing interaction, thereby reducing deployment complexity and making Ivy more practical for real-world personalization. We implemented our method in a popular videoconferencing application and demonstrated that Ivy can enhance QoE by 5.9% to 11.2% over individual BWE algorithms and by 6.3% to 11.4% compared to existing online meta heuristics.

標注 · 數據集 · 類別 · MoDELS · 類標記 ·

2024 年 9 月 28 日

Epic-Sounds: A Large-scale Dataset of Actions That Sound

Jaesung Huh,Jacob Chalk,Evangelos Kazakos,Dima Damen,Andrew Zisserman

from arxiv, 12 pages, 12 figures

We introduce Epic-Sounds, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from video, discarding ambiguities. Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate state-of-the-art audio recognition and detection models on our dataset, for both audio-only and audio-visual methods. We also conduct analysis on: the temporal overlap between audio events, the temporal and label correlations between audio and visual modalities, the ambiguities in annotating materials from audio-only input, the importance of audio-only labels and the limitations of current models to understand actions that sound. Project page : //epic-kitchens.github.io/epic-sounds/

INTERACT · 磁流變材料 · 回合 · CASE · 機器人 ·

2024 年 9 月 28 日

Gesture Recognition for Feedback Based Mixed Reality and Robotic Fabrication: A Case Study of the UnLog Tower

Alexander Htet Kyaw,Lawson Spencer,Sasa Zivkovic,Leslie Lok

from arxiv, 16 pages, 16 figures. Published in the Proceedings of the International Conference on Computational Design and Robotic Fabrication (CDRF) 2023

Mixed Reality (MR) platforms enable users to interact with three-dimensional holographic instructions during the assembly and fabrication of highly custom and parametric architectural constructions without the necessity of two-dimensional drawings. Previous MR fabrication projects have primarily relied on digital menus and custom buttons as the interface for user interaction with the MR environment. Despite this approach being widely adopted, it is limited in its ability to allow for direct human interaction with physical objects to modify fabrication instructions within the MR environment. This research integrates user interactions with physical objects through real-time gesture recognition as input to modify, update or generate new digital information enabling reciprocal stimuli between the physical and the virtual environment. Consequently, the digital environment is generative of the user's provided interaction with physical objects to allow seamless feedback in the fabrication process. This research investigates gesture recognition for feedback-based MR workflows for robotic fabrication, human assembly, and quality control in the construction of the UnLog Tower.

估計/估計量 · Integration · Better · Less · 查準率/準確率 ·

2024 年 9 月 27 日

Exploiting Motion Prior for Accurate Pose Estimation of Dashboard Cameras

Yipeng Lu,Yifan Zhao,Haiping Wang,Zhiwei Ruan,Yuan Liu,Zhen Dong,Bisheng Yang

Dashboard cameras (dashcams) record millions of driving videos daily, offering a valuable potential data source for various applications, including driving map production and updates. A necessary step for utilizing these dashcam data involves the estimation of camera poses. However, the low-quality images captured by dashcams, characterized by motion blurs and dynamic objects, pose challenges for existing image-matching methods in accurately estimating camera poses. In this study, we propose a precise pose estimation method for dashcam images, leveraging the inherent camera motion prior. Typically, image sequences captured by dash cameras exhibit pronounced motion prior, such as forward movement or lateral turns, which serve as essential cues for correspondence estimation. Building upon this observation, we devise a pose regression module aimed at learning camera motion prior, subsequently integrating these prior into both correspondences and pose estimation processes. The experiment shows that, in real dashcams dataset, our method is 22% better than the baseline for pose estimation in AUC5\textdegree, and it can estimate poses for 19% more images with less reprojection error in Structure from Motion (SfM).

DC · 標注 · Learning · 層 · MoDELS ·

2024 年 9 月 27 日

HiCuLR: Hierarchical Curriculum Learning for Rhetorical Role Labeling of Legal Documents

T. Y. S. S. Santosh,Apolline Isaia,Shiyu Hong,Matthias Grabmair

from arxiv, Accepted to EMNLP 2024 Findings

Rhetorical Role Labeling (RRL) of legal documents is pivotal for various downstream tasks such as summarization, semantic case search and argument mining. Existing approaches often overlook the varying difficulty levels inherent in legal document discourse styles and rhetorical roles. In this work, we propose HiCuLR, a hierarchical curriculum learning framework for RRL. It nests two curricula: Rhetorical Role-level Curriculum (RC) on the outer layer and Document-level Curriculum (DC) on the inner layer. DC categorizes documents based on their difficulty, utilizing metrics like deviation from a standard discourse structure and exposes the model to them in an easy-to-difficult fashion. RC progressively strengthens the model to discern coarse-to-fine-grained distinctions between rhetorical roles. Our experiments on four RRL datasets demonstrate the efficacy of HiCuLR, highlighting the complementary nature of DC and RC.