苹果电影在线观看免费高清_在线播放一区二区三区_日本 VA欧美大胆视频_国产欧美现场VA另类_呦男呦女视频精品_国产区精品一区二区三区在线_亚洲AV无码久久精品蜜桃播放

Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ($i$) a new task termed \emph{space-time instance segmentation}, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ($ii$) \emph{\dname}, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\url{//github.com/tub-rip/MouseSIS}

相關內容

示例

關注 0

噪聲 · Color · 可約的 · 離散化 · 分離的 ·

2024 年 10 月 31 日

HoloChrome: Polychromatic Illumination for Speckle Reduction in Holographic Near-Eye Displays

Florian Schiffers,Grace Kuo,Nathan Matsuda,Douglas Lanman,Oliver Cossairt

Holographic displays hold the promise of providing authentic depth cues, resulting in enhanced immersive visual experiences for near-eye applications. However, current holographic displays are hindered by speckle noise, which limits accurate reproduction of color and texture in displayed images. We present HoloChrome, a polychromatic holographic display framework designed to mitigate these limitations. HoloChrome utilizes an ultrafast, wavelength-adjustable laser and a dual-Spatial Light Modulator (SLM) architecture, enabling the multiplexing of a large set of discrete wavelengths across the visible spectrum. By leveraging spatial separation in our dual-SLM setup, we independently manipulate speckle patterns across multiple wavelengths. This novel approach effectively reduces speckle noise through incoherent averaging achieved by wavelength multiplexing. Our method is complementary to existing speckle reduction techniques, offering a new pathway to address this challenge. Furthermore, the use of polychromatic illumination broadens the achievable color gamut compared to traditional three-color primary holographic displays. Our simulations and tabletop experiments validate that HoloChrome significantly reduces speckle noise and expands the color gamut. These advancements enhance the performance of holographic near-eye displays, moving us closer to practical, immersive next-generation visual experiences.

回合 · 近似 · Agent · 穩健性 · Performer ·

2024 年 10 月 29 日

No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Alexander Rutherford,Michael Beukman,Timon Willi,Bruno Lacerda,Nick Hawes,Jakob Foerster

What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. Surprisingly, despite methods aiming to maximise regret in theory, the practical approximations do not correlate with regret but with success rate. As a result, a significant portion of an agent's experience comes from environments it has already mastered, offering little to no contribution toward enhancing its abilities. Put differently, current methods fail to predict intuitive measures of ``learnability.'' Specifically, they are unable to consistently identify those scenarios that the agent can sometimes solve, but not always. Based on our analysis, we develop a method that directly trains on scenarios with high learnability. This simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: //github.com/amacrutherford/sampling-for-learnability.

MoDELS · Performer · 可約的 · 自編碼器 · 潛在 ·

2024 年 10 月 29 日

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

Dongjun Kim,Chieh-Hsin Lai,Wei-Hsiang Liao,Yuhta Takida,Naoki Murata,Toshimitsu Uesaka,Yuki Mitsufuji,Stefano Ermon

from arxiv, NeurIPS 2024

The diffusion model performs remarkable in generating high-dimensional content but is computationally intensive, especially during training. We propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a novel pipeline that reduces the training costs through three stages: training diffusion on downsampled data, distilling the pretrained diffusion, and progressive super-resolution. With the proposed pipeline, PaGoDA achieves a $64\times$ reduced cost in training its diffusion model on 8x downsampled data; while at the inference, with the single-step, it performs state-of-the-art on ImageNet across all resolutions from 64x64 to 512x512, and text-to-image. PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models (e.g., Stable Diffusion). The code is available at //github.com/sony/pagoda.

詞元分析器 · 離散化 · MoDELS · Learning · 增強現實（AR） ·

2024 年 10 月 28 日

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

Hanyu Wang,Saksham Suri,Yixuan Ren,Hao Chen,Abhinav Shrivastava

from arxiv, Project page: //hywang66.github.io/larp/

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

MoDELS · 泛化理論 · 控制器 · 生成模型 · 機器人 ·

2024 年 10 月 26 日

GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

Kyle B. Hatch,Ashwin Balakrishna,Oier Mees,Suraj Nair,Seohong Park,Blake Wulfe,Masha Itkina,Benjamin Eysenbach,Sergey Levine,Thomas Kollar,Benjamin Burchfiel

from arxiv, Code, model checkpoints and videos can be found at //ghil-glue.github.io

Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively "glue together" language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.

數據集 · 可理解性 · 自動問答 · MoDELS · Performance ·

2024 年 10 月 21 日

CinePile: A Long Video Question Answering Dataset and Benchmark

Ruchit Rawal,Khalid Saifullah,Miquel Farré,Ronen Basri,David Jacobs,Gowthami Somepalli,Tom Goldstein

from arxiv, Project page with all the artifacts - //ruchitrawal.github.io/cinepile/. Updated version with adversarial refinement pipeline and more model evaluations

Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.

MoDELS · Subspace · 逼真度 · Performer · Backbone ·

2024 年 10 月 16 日

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

Jaehong Yoon,Shoubin Yu,Vaidehi Patil,Huaxiu Yao,Mohit Bansal

from arxiv, The first two authors contributed equally; Project page: //safree-safe-t2i-t2v.github.io/

Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model's weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

近似 · 優化器 · 表示 · 相互獨立的 · Next ·

2024 年 10 月 16 日

Triplet: Triangle Patchlet for Mesh-Based Inverse Rendering and Scene Parameters Approximation

Jiajie Yang

from arxiv, //github.com/RANDO11199/Triplet

Recent advancements in Radiance Fields have significantly improved novel-view synthesis. However, in many real-world applications, the more advanced challenge lies in inverse rendering, which seeks to derive the physical properties of a scene, including light, geometry, textures, and materials. Meshes, as a traditional representation adopted by many simulation pipeline, however, still show limited influence in radiance field for inverse rendering. This paper introduces a novel framework called Triangle Patchlet (abbr. Triplet), a mesh-based representation, to comprehensively approximate these scene parameters. We begin by assembling Triplets with either randomly generated points or sparse points obtained from camera calibration where all faces are treated as an independent element. Next, we simulate the physical interaction of light and optimize the scene parameters using traditional graphics rendering techniques like rasterization and ray tracing, accompanying with density control and propagation. An iterative mesh extracting process is also suggested, where we continue to optimize on geometry and materials with graph-based operation. We also introduce several regulation terms to enable better generalization of materials property. Our framework could precisely estimate the light, materials and geometry with mesh without prior of light, materials and geometry in a unified framework. Experiments demonstrate that our approach can achieve state-of-the-art visual quality while reconstructing high-quality geometry and accurate material properties.

MoDELS · 學成 · contrastive · 相互獨立的 · 下游任務 ·

2021 年 5 月 26 日

GeomCA: Geometric Evaluation of Data Representations

Petra Poklukar,Anastasia Varava,Danica Kragic

from arxiv, ICML2021 camera ready version

Evaluating the quality of learned representations without relying on a downstream task remains one of the challenges in representation learning. In this work, we present Geometric Component Analysis (GeomCA) algorithm that evaluates representation spaces based on their geometric and topological properties. GeomCA can be applied to representations of any dimension, independently of the model that generated them. We demonstrate its applicability by analyzing representations obtained from a variety of scenarios, such as contrastive learning models, generative models and supervised learning models.

entity · Performer · 事件抽取 · MoDELS · INFORMS ·

2018 年 12 月 1 日

One for All: Neural Joint Modeling of Entities and Events

Trung Minh Nguyen,Thien Huu Nguyen

from arxiv, Accepted at The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) (Honolulu, Hawaii, USA)

The previous work for event extraction has mainly focused on the predictions for event triggers and argument roles, treating entity mentions as being provided by human annotators. This is unrealistic as entity mentions are usually predicted by some existing toolkits whose errors might be propagated to the event trigger and argument role recognition. Few of the recent work has addressed this problem by jointly predicting entity mentions, event triggers and arguments. However, such work is limited to using discrete engineering features to represent contextual information for the individual tasks and their interactions. In this work, we propose a novel model to jointly perform predictions for entity mentions, event triggers and arguments based on the shared hidden representations from deep learning. The experiments demonstrate the benefits of the proposed method, leading to the state-of-the-art performance for event extraction.