国产特级黄色片A级无毛视频-中文字幕无线在线视频观看

Ana-Maria Marcu,Long Chen,Jan Hünermann,Alice Karnsund,Benoit Hanotte,Prajwal Chidananda,Saurabh Nair,Vijay Badrinarayanan,Alex Kendall,Jamie Shotton,Elahe Arani,Oleg Sinavski

from arxiv, Accepted to ECCV 2024. Benchmark and dataset are available at //github.com/wayveai/LingoQA/

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark //github.com/wayveai/LingoQA as an evaluation platform for vision-language models in autonomous driving.

知識薈萃

精品入門和進階教程、論文和代碼整理等

查看相關VIP內容、論文、資訊等

估計/估計量 · 穩健性 · state-of-the-art · Performer · SLAM ·

2024 年 11 月 4 日

BodySLAM: A Generalized Monocular Visual SLAM Framework for Surgical Applications

G. Manni,C. Lauretti,F. Prata,R. Papalia,L. Zollo,P. Soda

from arxiv, 16 pages, 7 figures

Endoscopic surgery relies on two-dimensional views, posing challenges for surgeons in depth perception and instrument manipulation. While Monocular Visual Simultaneous Localization and Mapping (MVSLAM) has emerged as a promising solution, its implementation in endoscopic procedures faces significant challenges due to hardware limitations, such as the use of a monocular camera and the absence of odometry sensors. This study presents BodySLAM, a robust deep learning-based MVSLAM approach that addresses these challenges through three key components: CycleVO, a novel unsupervised monocular pose estimation module; the integration of the state-of-the-art Zoe architecture for monocular depth estimation; and a 3D reconstruction module creating a coherent surgical map. The approach is rigorously evaluated using three publicly available datasets (Hamlyn, EndoSLAM, and SCARED) spanning laparoscopy, gastroscopy, and colonoscopy scenarios, and benchmarked against four state-of-the-art methods. Results demonstrate that CycleVO exhibited competitive performance with the lowest inference time among pose estimation methods, while maintaining robust generalization capabilities, whereas Zoe significantly outperformed existing algorithms for depth estimation in endoscopy. BodySLAM's strong performance across diverse endoscopic scenarios demonstrates its potential as a viable MVSLAM solution for endoscopic applications.

MoDELS · IDC · 穩健性 · 多樣性 · 數據集 ·

2024 年 11 月 4 日

OneDiff: A Generalist Model for Image Difference Captioning

Erdong Hu,Longteng Guo,Tongtian Yue,Zijia Zhao,Shuning Xue,Jing Liu

In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied contexts. This paper introduces the OneDiff model, a novel generalist approach that utilizes a robust vision-language model architecture, integrating a siamese image encoder with a Visual Delta Module. This innovative configuration allows for the precise detection and articulation of fine-grained differences between image pairs. OneDiff is trained through a dual-phase strategy, encompassing Coupled Sample Training and multi-task learning across a diverse array of data types, supported by our newly developed DiffCap Dataset. This dataset merges real-world and synthetic data, enhancing the training process and bolstering the model's robustness. Extensive testing on diverse IDC benchmarks, such as Spot-the-Diff, Image-Editing-Request, and Birds-to-Words, shows that OneDiff consistently outperforms existing state-of-the-art models in accuracy and adaptability, achieving improvements of up to 97% CIDEr points in average. By setting a new benchmark in IDC, OneDiff paves the way for more versatile and effective applications in detecting and describing visual differences. The code, models, and data will be made publicly available.

INTERACT · Prompt · Performer · 圖像分割 · 變換 ·

2024 年 11 月 3 日

PVPUFormer: Probabilistic Visual Prompt Unified Transformer for Interactive Image Segmentation

Xu Zhang,Kailun Yang,Jiacheng Lin,Jin Yuan,Zhiyong Li,Shutao Li

from arxiv, Accepted to IEEE Transactions on Image Processing (TIP). Code is available at //github.com/XuZhang1211/PVPUFormer

Integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation significantly facilitates users' interaction as well as improves interaction efficiency. However, existing studies primarily encode the position or pixel regions of prompts without considering the contextual areas around them, resulting in insufficient prompt feedback, which is not conducive to performance acceleration. To tackle this problem, this paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting. Specifically, we first propose a Probabilistic Prompt-unified Encoder (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt contextual information, offering richer feedback cues to accelerate performance improvement. On this basis, we further present a Prompt-to-Pixel Contrastive (P$^2$C) loss to accurately align both prompt and pixel features, bridging the representation gap between them to offer consistent feature representations for mask prediction. Moreover, our approach designs a Dual-cross Merging Attention (DMA) module to implement bidirectional feature interaction between image and prompt features, generating notable features for performance improvement. A comprehensive variety of experiments on several challenging datasets demonstrates that the proposed components achieve consistent improvements, yielding state-of-the-art interactive segmentation performance. Our code is available at //github.com/XuZhang1211/PVPUFormer.

MoDELS · 簇 · GPU · 變換 · GPUs ·

2024 年 11 月 1 日

Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models

Runsheng Benson Guo,Utkarsh Anand,Arthur Chen,Khuzaima Daudjee

Training transformer models requires substantial GPU compute and memory resources. In homogeneous clusters, distributed strategies allocate resources evenly, but this approach is inefficient for heterogeneous clusters, where GPUs differ in power and memory. As high-end GPUs are costly and limited in availability, heterogeneous clusters with diverse GPU types are becoming more common. Existing methods attempt to balance compute across GPUs based on capacity but often underutilize compute due to memory constraints. We present Cephalo, a system that optimizes compute and memory usage by decoupling compute distribution from training state assignment. %Compared to state-of-the-art methods, Cephalo achieves significantly higher training throughput while supporting larger models and batch sizes. Cephalo outperforms state-of-the-art methods by achieving significantly higher training throughput while supporting larger models and batch sizes.

知識 (knowledge) · 3D · Vision · 粵港澳大灣區數字經濟研究院 · 機器人 ·

2024 年 11 月 1 日

ConceptFactory: Facilitate 3D Object Knowledge Annotation with Object Conceptualization

Jianhua Sun,Yuxuan Li,Longfei Xu,Nange Wang,Jiude Wei,Yining Zhang,Cewu Lu

from arxiv, NeurIPS 2024 Track on Datasets and Benchmarks

We present ConceptFactory, a novel scope to facilitate more efficient annotation of 3D object knowledge by recognizing 3D objects through generalized concepts (i.e. object conceptualization), aiming at promoting machine intelligence to learn comprehensive object knowledge from both vision and robotics aspects. This idea originates from the findings in human cognition research that the perceptual recognition of objects can be explained as a process of arranging generalized geometric components (e.g. cuboids and cylinders). ConceptFactory consists of two critical parts: i) ConceptFactory Suite, a unified toolbox that adopts Standard Concept Template Library (STL-C) to drive a web-based platform for object conceptualization, and ii) ConceptFactory Asset, a large collection of conceptualized objects acquired using ConceptFactory suite. Our approach enables researchers to effortlessly acquire or customize extensive varieties of object knowledge to comprehensively study different object understanding tasks. We validate our idea on a wide range of benchmark tasks from both vision and robotics aspects with state-of-the-art algorithms, demonstrating the high quality and versatility of annotations provided by our approach. Our website is available at //apeirony.github.io/ConceptFactory.

AAC · Automator · ROUGE · BLEU · 圖像字幕 ·

2024 年 11 月 1 日

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Satvik Dixit,Soham Deshmukh,Bhiksha Raj

The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at //github.com/satvik-dixit/mace

Automator · MoDELS · 語言模型化 · 分解 · Performer ·

2024 年 10 月 31 日

TaskBench: Benchmarking Large Language Models for Task Automation

Yongliang Shen,Kaitao Song,Xu Tan,Wenqi Zhang,Kan Ren,Siyu Yuan,Weiming Lu,Dongsheng Li,Yueting Zhuang

from arxiv, Accepted to NeurIPS 2024

In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasks and adopt a back-instruct method to generate high-quality user instructions. We propose TaskEval, a multi-faceted evaluation methodology that assesses LLM performance across these three stages. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation. Experimental results demonstrate that TaskBench effectively reflects the capabilities of various LLMs in task automation. It provides insights into model performance across different task complexities and domains, pushing the boundaries of what current models can achieve. TaskBench offers a scalable, adaptable, and reliable benchmark for advancing LLM-based autonomous agents.

contrastive · 逼真度 · 損失 · MoDELS · Branch ·

2024 年 10 月 31 日

PuLID: Pure and Lightning ID Customization via Contrastive Alignment

Zinan Guo,Yanze Wu,Zhuowei Chen,Lang Chen,Peng Zhang,Qian He

from arxiv, NeurIPS 2024. Codes and models are available at //github.com/ToTheBeginning/PuLID

We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. By incorporating a Lightning T2I branch with a standard diffusion one, PuLID introduces both contrastive alignment loss and accurate ID loss, minimizing disruption to the original model and ensuring high ID fidelity. Experiments show that PuLID achieves superior performance in both ID fidelity and editability. Another attractive property of PuLID is that the image elements (e.g., background, lighting, composition, and style) before and after the ID insertion are kept as consistent as possible. Codes and models are available at //github.com/ToTheBeginning/PuLID

圖 · 知識圖譜 · 鏈路預測 · Extensibility · entity ·

2020 年 10 月 6 日

CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

Tara Safavi,Danai Koutra

from arxiv, EMNLP 2020

We present CoDEx, a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from the popular FB15K-237 knowledge graph completion dataset by showing that CoDEx covers more diverse and interpretable content, and is a more difficult link prediction benchmark. Data, code, and pretrained models are available at //bit.ly/2EPbrJs.

圖 · 學成 · 知識圖譜 · FreeBASIC · 強化學習 ·

2018 年 1 月 8 日

DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning

Wenhan Xiong,Thien Hoang,William Yang Wang

We study the problem of learning to reason in large scale knowledge graphs (KGs). More specifically, we describe a novel reinforcement learning framework for learning multi-hop relational paths: we use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, our approach includes a reward function that takes the accuracy, diversity, and efficiency into consideration. Experimentally, we show that our proposed method outperforms a path-ranking based algorithm and knowledge graph embedding methods on Freebase and Never-Ending Language Learning datasets.