云南虫谷在线观看免费观看电视剧,五月天婷婷丁香基地综合,在线免费黄色网站,亚洲欧美精品国产一区在线观看,久久久久久久精品成人热色戒

Topological correctness plays a critical role in many image segmentation tasks, yet most networks are trained using pixel-wise loss functions, such as Dice, neglecting topological accuracy. Existing topology-aware methods often lack robust topological guarantees, are limited to specific use cases, or impose high computational costs. In this work, we propose a novel, graph-based framework for topologically accurate image segmentation that is both computationally efficient and generally applicable. Our method constructs a component graph that fully encodes the topological information of both the prediction and ground truth, allowing us to efficiently identify topologically critical regions and aggregate a loss based on local neighborhood information. Furthermore, we introduce a strict topological metric capturing the homotopy equivalence between the union and intersection of prediction-label pairs. We formally prove the topological guarantees of our approach and empirically validate its effectiveness on binary and multi-class datasets. Our loss demonstrates state-of-the-art performance with up to fivefold faster loss computation compared to persistent homology methods.

相關內容

損失

關注 1

示例 · Processing（編程語言） · MoDELS · 解碼 · 知識 (knowledge) ·

2024 年 12 月 18 日

MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing

Chuang Yang,Bingxuan Zhao,Qing Zhou,Qi Wang

The rapid advancement of deep generative models (DGMs) has significantly advanced research in computer vision, providing a cost-effective alternative to acquiring vast quantities of expensive imagery. However, existing methods predominantly focus on synthesizing remote sensing (RS) images aligned with real images in a global layout view, which limits their applicability in RS image object detection (RSIOD) research. To address these challenges, we propose a multi-class and multi-scale object image generator based on DGMs, termed MMO-IG, designed to generate RS images with supervised object labels from global and local aspects simultaneously. Specifically, from the local view, MMO-IG encodes various RS instances using an iso-spacing instance map (ISIM). During the generation process, it decodes each instance region with iso-spacing value in ISIM-corresponding to both background and foreground instances-to produce RS images through the denoising process of diffusion models. Considering the complex interdependencies among MMOs, we construct a spatial-cross dependency knowledge graph (SCDKG). This ensures a realistic and reliable multidirectional distribution among MMOs for region embedding, thereby reducing the discrepancy between source and target domains. Besides, we propose a structured object distribution instruction (SODI) to guide the generation of synthesized RS image content from a global aspect with SCDKG-based ISIM together. Extensive experimental results demonstrate that our MMO-IG exhibits superior generation capabilities for RS images with dense MMO-supervised labels, and RS detectors pre-trained with MMO-IG show excellent performance on real-world datasets.

機器人 · 多樣性 · EASE · 評論員 · 可理解性 ·

2024 年 12 月 18 日

TelePhantom: A User-Friendly Teleoperation System with Virtual Assistance for Enhanced Effectiveness

Jingxiang Guo,Jiayu Luo,Zhenyu Wei,Yiwen Hou,Zhixuan Xu,Xiaoyi Lin,Chongkai Gao,Lin Shao

from arxiv, Submitted to RA-L

Dexterous manipulation is a critical area of robotics. In this field, teleoperation faces three key challenges: user-friendliness for novices, safety assurance, and transferability across different platforms. While collecting real robot dexterous manipulation data by teleoperation to train robots has shown impressive results on diverse tasks, due to the morphological differences between human and robot hands, it is not only hard for new users to understand the action mapping but also raises potential safety concerns during operation. To address these limitations, we introduce TelePhantom. This teleoperation system offers real-time visual feedback on robot actions based on human user inputs, with a total hardware cost of less than $1,000. TelePhantom allows the user to see a virtual robot that represents the outcome of the user's next movement. By enabling flexible switching between command visualization and actual execution, this system helps new users learn how to demonstrate quickly and safely. We demonstrate its superiority over other teleoperation systems across five tasks, emphasize its ease of use, and highlight its ease of deployment across diverse input sensors and robotic platforms. We will release our code and a deployment document on our website: //telephantom.github.io/.

控制器 · Learning · 平滑 · Extensibility · 設計 ·

2024 年 12 月 17 日

MotionBridge: Dynamic Video Inbetweening with Flexible Controls

Maham Tanveer,Yang Zhou,Simon Niklaus,Ali Mahdavi Amiri,Hao Zhang,Krishna Kumar Singh,Nanxuan Zhao

By generating plausible and smooth transitions between two image frames, video inbetweening is an essential tool for video editing and long video synthesis. Traditional works lack the capability to generate complex large motions. While recent video generation techniques are powerful in creating high-quality results, they often lack fine control over the details of intermediate frames, which can lead to results that do not align with the creative mind. We introduce MotionBridge, a unified video inbetweening framework that allows flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. However, learning such multi-modal controls in a unified framework is a challenging task. We thus design two generators to extract the control signal faithfully and encode feature through dual-branch embedders to resolve ambiguities. We further introduce a curriculum training strategy to smoothly learn various controls. Extensive qualitative and quantitative experiments have demonstrated that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

MoDELS · 控制器 · 多樣性 · Projection · 輸出 ·

2024 年 12 月 17 日

Move-in-2D: 2D-Conditioned Human Motion Generation

Hsin-Ping Huang,Yang Zhou,Jui-Hsien Wang,Difan Liu,Feng Liu,Ming-Hsuan Yang,Zhan Xu

from arxiv, Project page: //hhsinping.github.io/Move-in-2D/

Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.

圖 · 知識 (knowledge) · 知識圖譜 · 評論員 · SQL ·

2024 年 12 月 17 日

SynthCypher: A Fully Synthetic Data Generation Framework for Text-to-Cypher Querying in Knowledge Graphs

Aman Tiwari,Shiva Krishna Reddy Malay,Vikas Yadav,Masoud Hashemi,Sathwik Tejaswi Madhusudhan

Cypher, the query language for Neo4j graph databases, plays a critical role in enabling graph-based analytics and data exploration. While substantial research has been dedicated to natural language to SQL query generation (Text2SQL), the analogous problem for graph databases referred to as Text2Cypher remains underexplored. In this work, we introduce SynthCypher, a fully synthetic and automated data generation pipeline designed to address this gap. SynthCypher employs a novel LLMSupervised Generation-Verification framework, ensuring syntactically and semantically correct Cypher queries across diverse domains and query complexities. Using this pipeline, we create SynthCypher Dataset, a large-scale benchmark containing 29.8k Text2Cypher instances. Fine-tuning open-source large language models (LLMs), including LLaMa-3.1- 8B, Mistral-7B, and QWEN-7B, on SynthCypher yields significant performance improvements of up to 40% on the Text2Cypher test set and 30% on the SPIDER benchmark adapted for graph databases. This work demonstrates that high-quality synthetic data can effectively advance the state-of-the-art in Text2Cypher tasks.

數據集 · 多峰值 · MoDELS · 模態 · 相關系數 ·

2024 年 12 月 17 日

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

Xiaowei Chi,Yatian Wang,Aosong Cheng,Pengjun Fang,Zeyue Tian,Yingqing He,Zhaoyang Liu,Xingqun Qi,Jiahao Pan,Rongyu Zhang,Mengfei Li,Ruibin Yuan,Yanbing Jiang,Wei Xue,Wenhan Luo,Qifeng Chen,Shanghang Zhang,Qifeng Liu,Yike Guo

from arxiv, 15 Pages. Dataset report

Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modality instead of comprehensive and precise descriptions. Such ignorance results in the difficulty of multiple cross-modality studies. To fulfill this gap, we present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions, and 2M high-quality clips with multimodal captions. Trailers preview full-length video works and integrate context, visual frames, and background music. In particular, the trailer has two main advantages: (1) the topics are diverse, and the content characters are of various types, e.g., film, news, and gaming. (2) the corresponding background music is custom-designed, making it more coherent with the visual context. Upon these insights, we propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Here, to ensure the caption retains music perspective while preserving the authority of visual context, we leverage the advanced LLM to merge all annotations adaptively. In this fashion, our MMtrail dataset potentially paves the path for fine-grained large multimodal-language model training. In experiments, we provide evaluation metrics and benchmark results on our dataset, demonstrating the high quality of our annotation and its effectiveness for model training.

相同 · 穩健性 · 設計 · 數據集 · Processing（編程語言） ·

2024 年 12 月 15 日

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

Hao Shao,Shulun Wang,Yang Zhou,Guanglu Song,Dailan He,Shuo Qin,Zhuofan Zong,Bingqi Ma,Yu Liu,Hongsheng Li

from arxiv, project page: //hao-shao.com/projects/vividface.html

Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos. To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations. Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.

AI · Processing（編程語言） · MoDELS · 設計 · Neural Networks ·

2024 年 12 月 13 日

Neuro-Photonix: Enabling Near-Sensor Neuro-Symbolic AI Computing on Silicon Photonics Substrate

Deniz Najafi,Hamza Errahmouni Barkam,Mehrdad Morsali,SungHeon Jeong,Tamoghno Das,Arman Roohi,Mahdi Nikdast,Mohsen Imani,Shaahin Angizi

from arxiv, 12 pages, 15 figures

Neuro-symbolic Artificial Intelligence (AI) models, blending neural networks with symbolic AI, have facilitated transparent reasoning and context understanding without the need for explicit rule-based programming. However, implementing such models in the Internet of Things (IoT) sensor nodes presents hurdles due to computational constraints and intricacies. In this work, for the first time, we propose a near-sensor neuro-symbolic AI computing accelerator named Neuro-Photonix for vision applications. Neuro-photonix processes neural dynamic computations on analog data while inherently supporting granularity-controllable convolution operations through the efficient use of photonic devices. Additionally, the creation of an innovative, low-cost ADC that works seamlessly with photonic technology removes the necessity for costly ADCs. Moreover, Neuro-Photonix facilitates the generation of HyperDimensional (HD) vectors for HD-based symbolic AI computing. This approach allows the proposed design to substantially diminish the energy consumption and latency of conversion, transmission, and processing within the established cloud-centric architecture and recently designed accelerators. Our device-to-architecture results show that Neuro-Photonix achieves 30 GOPS/W and reduces power consumption by a factor of 20.8 and 4.1 on average on neural dynamics compared to ASIC baselines and photonic accelerators while preserving accuracy.

3D · 穩健性 · 控制器 · 剪枝 · MoDELS ·

2024 年 12 月 13 日

SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video

Jongmin Park,Minh-Quan Viet Bui,Juan Luis Gonzalez Bello,Jaeho Moon,Jihyong Oh,Munchurl Kim

from arxiv, The first two authors contributed equally to this work (equal contribution). The last two authors advised equally to this work. Please visit our project page at this //kaist-viclab.github.io/splinegs-site/

Synthesizing novel views from in-the-wild monocular videos is challenging due to scene dynamics and the lack of multi-view cues. To address this, we propose SplineGS, a COLMAP-free dynamic 3D Gaussian Splatting (3DGS) framework for high-quality reconstruction and fast rendering from monocular videos. At its core is a novel Motion-Adaptive Spline (MAS) method, which represents continuous dynamic 3D Gaussian trajectories using cubic Hermite splines with a small number of control points. For MAS, we introduce a Motion-Adaptive Control points Pruning (MACP) method to model the deformation of each dynamic 3D Gaussian across varying motions, progressively pruning control points while maintaining dynamic modeling integrity. Additionally, we present a joint optimization strategy for camera parameter estimation and 3D Gaussian attributes, leveraging photometric and geometric consistency. This eliminates the need for Structure-from-Motion preprocessing and enhances SplineGS's robustness in real-world conditions. Experiments show that SplineGS significantly outperforms state-of-the-art methods in novel view synthesis quality for dynamic scenes from monocular videos, achieving thousands times faster rendering speed.

Pyramid · MoDELS · Extensibility · state-of-the-art · Performer ·

2022 年 12 月 1 日

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

Wan-Cyuan Fan,Yen-Chun Chen,Dongdong Chen,Yu Cheng,Lu Yuan,Yu-Chiang Frank Wang

from arxiv, AAAI 2023

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.