亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

<dir id='oyoQs'><del id='JmhZH'><del id='iXuoz'></del><pre id='HEXXQ'><pre id='ODlHh'><option id='SB7Bu'><address id='a1npw'></address><bdo id='6sMZb'><tr id='wTW8h'><acronym id='CowlU'><pre id='rm33i'></pre></acronym><div id='VmRIK'></div></tr></bdo></option></pre><small id='ipQEG'><address id='rKf3D'><u id='p38Vt'><legend id='5RoAP'><option id='EfVQd'><abbr id='Y9ugL'></abbr><li id='TLdp3'><pre id='gcsAZ'></pre></li></option></legend><select id='3iDSE'></select></u></address></small></pre></del><sup id='zpgJn'></sup><blockquote id='e8r4f'><dt id='dE0AS'></dt></blockquote><blockquote id='sp9rU'></blockquote></dir><tt id='OYR4m'></tt><u id='3yYKX'><tt id='wLdlN'><form id='J3V3H'></form></tt><td id='VtjkZ'><dt id='Mc2RH'></dt></td></u>

<code id='iFTPa'><i id='6pT2P'><q id='vPEFV'><legend id='t04Qs'><pre id='kCHE2'><style id='qBont'><acronym id='sXYIy'><i id='txKVE'><form id='FVeMJ'><option id='W84Hw'><center id='Mkwzp'></center></option></form></i></acronym></style><tt id='P9pXU'></tt></pre></legend></q></i></code><center id='DaFHx'></center>

<dd id='iPyW3'></dd>

<style id='JJZMb'></style><sub id='rnGQK'><dfn id='YXWW0'><abbr id='b8j0o'><big id='IdGxS'><bdo id='iVo2I'></bdo></big></abbr></dfn></sub>_{<dir id='G9Oiu'></dir>}

·

稀疏 · MoDELS · state-of-the-art · Performer · 特征提取 ·

2024 年 1 月 29 日

Synchformer: Efficient Synchronization from Sparse Cues

Vladimir Iashin,Weidi Xie,Esa Rahtu,Andrew Zisserman

from arxiv, Extended version of the ICASSP 24 paper. Project page: //www.robots.ox.ac.uk/~vgg/research/synchformer/ Code: //github.com/v-iashin/Synchformer

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

相關內容

MoDELS · 查準率/準確率 · ConvNets · Neural Networks · Networking ·

2024 年 3 月 11 日

Deep Learning Approaches for Human Action Recognition in Video Data

Human action recognition in videos is a critical task with significant implications for numerous applications, including surveillance, sports analytics, and healthcare. The challenge lies in creating models that are both precise in their recognition capabilities and efficient enough for practical use. This study conducts an in-depth analysis of various deep learning models to address this challenge. Utilizing a subset of the UCF101 Videos dataset, we focus on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Two-Stream ConvNets. The research reveals that while CNNs effectively capture spatial features and RNNs encode temporal sequences, Two-Stream ConvNets exhibit superior performance by integrating spatial and temporal dimensions. These insights are distilled from the evaluation metrics of accuracy, precision, recall, and F1-score. The results of this study underscore the potential of composite models in achieving robust human action recognition and suggest avenues for future research in optimizing these models for real-world deployment.

約束 · 語言模型化 · MoDELS · 類別 · entity ·

2024 年 3 月 10 日

From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification

Fei Wang,Chao Shang,Sarthak Jain,Shuai Wang,Qiang Ning,Bonan Min,Vittorio Castelli,Yassine Benajiba,Dan Roth

User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.

Agent · 大語言模型 · MoDELS · Processing（編程語言） · 目標檢測 ·

2024 年 3 月 10 日

Reframe Anything: LLM Agent for Open World Video Reframing

Jiawang Cao,Yongliang Wu,Weiheng Chi,Wenbo Zhu,Ziyue Su,Jay Wu

from arxiv, 14 pages, 6 figures

The proliferation of mobile devices and social media has revolutionized content dissemination, with short-form video becoming increasingly prevalent. This shift has introduced the challenge of video reframing to fit various screen aspect ratios, a process that highlights the most compelling parts of a video. Traditionally, video reframing is a manual, time-consuming task requiring professional expertise, which incurs high production costs. A potential solution is to adopt some machine learning models, such as video salient object detection, to automate the process. However, these methods often lack generalizability due to their reliance on specific training data. The advent of powerful large language models (LLMs) open new avenues for AI capabilities. Building on this, we introduce Reframe Any Video Agent (RAVA), a LLM-based agent that leverages visual foundation models and human instructions to restructure visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.

INTERACT · 推薦系統 · INFORMS · Less · 語言模型化 ·

2024 年 3 月 9 日

MuseChat: A Conversational Music Recommendation System for Videos

Zhikang Dong,Bin Chen,Xiulong Liu,Pawel Polak,Peng Zhang

Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users' preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user's preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build a large-scale dataset, conversational music recommendation for videos, that simulates a two-turn interaction between a user and a recommender based on accurate music track information. Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactability.

MoDELS · GIF · Guidance · 變換 · Extensibility ·

2024 年 3 月 8 日

Pix2Gif: Motion-Guided Diffusion for GIF Generation

Hitesh Kandala,Jianfeng Gao,Jianwei Yang

We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts, as shown in teaser fig. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs. Code, dataset and models are made public at: //hiteshk03.github.io/Pix2Gif/.

Vision · 變換 · Extensibility · Performer · INFORMS ·

2024 年 3 月 7 日

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Kaishen Yuan,Zitong Yu,Xin Liu,Weicheng Xie,Huanjing Yue,Jingyu Yang

from arxiv, 19 pages, 6 figures

Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable parameters on scarce AU-annotated datasets or heavy reliance on substantial additional relevant data. Parameter-Efficient Transfer Learning (PETL) provides a promising paradigm to address these challenges, whereas its existing methods lack design for AU characteristics. Therefore, we innovatively investigate PETL paradigm to AU detection, introducing AUFormer and proposing a novel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism. An individual MoKE specific to a certain AU with minimal learnable parameters first integrates personalized multi-scale and correlation knowledge. Then the MoKE collaborates with other MoKEs in the expert group to obtain aggregated information and inject it into the frozen Vision Transformer (ViT) to achieve parameter-efficient AU detection. Additionally, we design a Margin-truncated Difficulty-aware Weighted Asymmetric Loss (MDWA-Loss), which can encourage the model to focus more on activated AUs, differentiate the difficulty of unactivated AUs, and discard potential mislabeled samples. Extensive experiments from various perspectives, including within-domain, cross-domain, data efficiency, and micro-expression domain, demonstrate AUFormer's state-of-the-art performance and robust generalization abilities without relying on additional relevant data. The code for AUFormer is available at //github.com/yuankaishen2001/AUFormer.

Processing（編程語言） · MoDELS · 生成式人工智能 · Automator · INTERACT ·

2024 年 3 月 7 日

ProMoAI: Process Modeling with Generative AI

Humam Kourani,Alessandro Berti,Daniel Schuster,Wil M. P. van der Aalst

ProMoAI is a novel tool that leverages Large Language Models (LLMs) to automatically generate process models from textual descriptions, incorporating advanced prompt engineering, error handling, and code generation techniques. Beyond automating the generation of complex process models, ProMoAI also supports process model optimization. Users can interact with the tool by providing feedback on the generated model, which is then used for refining the process model. ProMoAI utilizes the capabilities LLMs to offer a novel, AI-driven approach to process modeling, significantly reducing the barrier to entry for users without deep technical knowledge in process modeling.

多峰值 · 離散化 · MoDELS · 大語言模型 · 語言模型化 ·

2024 年 3 月 7 日

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Jun Zhan,Junqi Dai,Jiasheng Ye,Yunhua Zhou,Dong Zhang,Zhigeng Liu,Xin Zhang,Ruibin Yuan,Ge Zhang,Linyang Li,Hang Yan,Jie Fu,Tao Gui,Tianxiang Sun,Yugang Jiang,Xipeng Qiu

from arxiv, 28 pages, 16 figures, under review, work in progress

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in //junzhan2000.github.io/AnyGPT.github.io/

生成式人工智能 · 設計 · 多樣性 · AI · Processing（編程語言） ·

2024 年 3 月 7 日

CreativeConnect: Supporting Reference Recombination for Graphic Design Ideation with Generative AI

DaEun Choi,Sumin Hong,Jeongeon Park,John Joon Young Chung,Juho Kim

Graphic designers often get inspiration through the recombination of references. Our formative study (N=6) reveals that graphic designers focus on conceptual keywords during this process, and want support for discovering the keywords, expanding them, and exploring diverse recombination options of them, while still having room for designers' creativity. We propose CreativeConnect, a system with generative AI pipelines that helps users discover useful elements from the reference image using keywords, recommends relevant keywords, generates diverse recombination options with user-selected keywords, and shows recombinations as sketches with text descriptions. Our user study (N=16) showed that CreativeConnect helped users discover keywords from the reference and generate multiple ideas based on them, ultimately helping users produce more design ideas with higher self-reported creativity compared to the baseline system without generative pipelines. While CreativeConnect was shown effective in ideation, we discussed how CreativeConnect can be extended to support other types of tasks in creativity support.

任務對話系統 · INFORMS · 圖 · Networking · entity ·

2020 年 8 月 11 日

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Xiaoze Jiang,Siyi Du,Zengchang Qin,Yajing Sun,Jing Yu

from arxiv, Accepted by the 28th ACM International Conference on Multimedia (ACM MM 2020)

Visual dialogue is a challenging task that needs to extract implicit information from both visual (image) and textual (dialogue history) contexts. Classical approaches pay more attention to the integration of the current question, vision knowledge and text knowledge, despising the heterogeneous semantic gaps between the cross-modal information. In the meantime, the concatenation operation has become de-facto standard to the cross-modal information fusion, which has a limited ability in information retrieval. In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. Moreover, the reasoning clues for visual dialogue can be clearly drawn from intra-modal entities and inter-modal bridges. Experimental results on VisDial v1.0 and VisDial-Q datasets demonstrate that our model outperforms exiting models with state-of-the-art results.

閱讀: 0 點贊: 0

小貼士

登錄享

相關主題

state-of-the-art

北京阿比特科技有限公司

注冊地址：北京市海淀區羊坊店路18號2幢3層301-191

<li id='EEVlX'></li>

_{^{<dd id='j8Mtl'><tbody id='gAw2b'><td id='BrnZh'><optgroup id='BOZIF'><strong id='szJmg'></strong></optgroup><address id='B0x8X'><ul id='5IFGK'></ul></address><big id='lea6Y'></big></td><table id='JQ9Nt'></table></tbody><pre id='XtK7n'></pre></dd><span id='K9rWD'><b id='UyznT'></b></span>}}


<dfn id='ChxEu'><optgroup id='IYgDp'></optgroup></dfn><tfoot id='bA53H'><bdo id='KNYkt'><div id='l6OSi'></div><i id='qwas2'><dt id='lfkgX'></dt></i></bdo></tfoot>

_{<fieldset id='mH1jF'></fieldset>}