亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

·

圖像檢索 · 秩 · Learning · 多樣性 · 學習的學習 ·

2023 年 8 月 16 日

Ranking-aware Uncertainty for Text-guided Image Retrieval

Junyang Chen,Hanjiang Lai

Text-guided image retrieval is to incorporate conditional text to better capture users' intent. Traditionally, the existing methods focus on minimizing the embedding distances between the source inputs and the targeted image, using the provided triplets $\langle$source image, source text, target image$\rangle$. However, such triplet optimization may limit the learned retrieval model to capture more detailed ranking information, e.g., the triplets are one-to-one correspondences and they fail to account for many-to-many correspondences arising from semantic diversity in feedback languages and images. To capture more ranking information, we propose a novel ranking-aware uncertainty approach to model many-to-many correspondences by only using the provided triplets. We introduce uncertainty learning to learn the stochastic ranking list of features. Specifically, our approach mainly comprises three components: (1) In-sample uncertainty, which aims to capture semantic diversity using a Gaussian distribution derived from both combined and target features; (2) Cross-sample uncertainty, which further mines the ranking information from other samples' distributions; and (3) Distribution regularization, which aligns the distributional representations of source inputs and targeted image. Compared to the existing state-of-the-art methods, our proposed method achieves significant results on two public datasets for composed image retrieval.

相關內容

圖像檢索

從20世紀70年代開始，有關圖像檢索的研究就已開始，當時主要是基于文本的圖像檢索技術（Text-based Image Retrieval，簡稱TBIR），利用文本描述的方式描述圖像的特征，如繪畫作品的作者、年代、流派、尺寸等。到90年代以后，出現了對圖像的內容語義，如圖像的顏色、紋理、布局等進行分析和檢索的圖像檢索技術，即基于內容的圖像檢索（Content-based Image Retrieval，簡稱CBIR）技術。CBIR屬于基于內容檢索（Content-based Retrieval，簡稱CBR）的一種，CBR中還包括對動態視頻、音頻等其它形式多媒體信息的檢索技術。

知識薈萃

精品入門和進階教程、論文和代碼整理等

更多

查看相關VIP內容、論文、資訊等

INFORMS · MoDELS · Performer · 語言模型化 · 端到端 ·

2023 年 10 月 4 日

DOMINO: A Dual-System for Multi-step Visual Language Reasoning

Peifang Wang,Olga Golovneva,Armen Aghajanyan,Xiang Ren,Muhao Chen,Asli Celikyilmaz,Maryam Fazel-Zarandi

Visual language reasoning requires a system to extract text or numbers from information-dense images like charts or plots and perform logical or arithmetic reasoning to arrive at an answer. To tackle this task, existing work relies on either (1) an end-to-end vision-language model trained on a large amount of data, or (2) a two-stage pipeline where a captioning model converts the image into text that is further read by another large language model to deduce the answer. However, the former approach forces the model to answer a complex question with one single step, and the latter approach is prone to inaccurate or distracting information in the converted text that can confuse the language model. In this work, we propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning. Given an input, System-2 breaks down the question into atomic sub-steps, each guiding System-1 to extract the information required for reasoning from the image. Experiments on chart and plot datasets show that our method with a pre-trained System-2 module performs competitively compared to prior work on in- and out-of-distribution data. By fine-tuning the System-2 module (LLaMA-2 70B) on only a small amount of data on multi-step reasoning, the accuracy of our method is further improved and surpasses the best fully-supervised end-to-end approach by 5.7% and a pipeline approach with FlanPaLM (540B) by 7.5% on a challenging dataset with human-authored questions.

Learning · MoDELS · 泛函 · 近似 · 易處理的 ·

2023 年 10 月 4 日

Learning Object-Centric Neural Scattering Functions for Free-Viewpoint Relighting and Scene Composition

Hong-Xing Yu,Michelle Guo,Alireza Fathi,Yen-Yu Chang,Eric Ryan Chan,Ruohan Gao,Thomas Funkhouser,Jiajun Wu

from arxiv, Journal extension of arXiv:2012.08503 (TMLR 2023). The first two authors contributed equally to this work. Project page: //kovenyu.com/osf/

Photorealistic object appearance modeling from 2D images is a constant topic in vision and graphics. While neural implicit methods (such as Neural Radiance Fields) have shown high-fidelity view synthesis results, they cannot relight the captured objects. More recent neural inverse rendering approaches have enabled object relighting, but they represent surface properties as simple BRDFs, and therefore cannot handle translucent objects. We propose Object-Centric Neural Scattering Functions (OSFs) for learning to reconstruct object appearance from only images. OSFs not only support free-viewpoint object relighting, but also can model both opaque and translucent objects. While accurately modeling subsurface light transport for translucent objects can be highly complex and even intractable for neural methods, OSFs learn to approximate the radiance transfer from a distant light to an outgoing direction at any spatial location. This approximation avoids explicitly modeling complex subsurface scattering, making learning a neural implicit model tractable. Experiments on real and synthetic data show that OSFs accurately reconstruct appearances for both opaque and translucent objects, allowing faithful free-viewpoint relighting as well as scene composition.

模態 · 數據集 · 可理解性 · 模型評估 · 特征空間 ·

2023 年 10 月 3 日

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu,Bin Lin,Munan Ning,Yang Yan,Jiaxi Cui,Wang HongFa,Yatian Pang,Wenhao Jiang,Junwu Zhang,Zongwei Li,Cai Wan Zhang,Zhifeng Li,Wei Liu,Li Yuan

from arxiv, Under review as a conference paper at ICLR 2024

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 1.2% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval, validating the high quality of our dataset. Beyond this, our LanguageBind has achieved great improvement in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, on the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind-huge with 23.8% and 11.1% top-1 accuracy.

3D · MoDELS · Learning · 得分 · 蒸餾 ·

2023 年 10 月 2 日

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi,Peng Wang,Jianglong Ye,Mai Long,Kejie Li,Xiao Yang

from arxiv, Our project page is //MV-Dream.github.io

We introduce MVDream, a multi-view diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view prior can serve as a generalizable 3D prior that is agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.

MoDELS · Prompt · 通用動力公司 · Guidance · 語言模型化 ·

2023 年 10 月 2 日

LLM-grounded Video Diffusion Models

Long Lian,Baifeng Shi,Adam Yala,Trevor Darrell,Boyi Li

from arxiv, Project Page: //llm-grounded-video-diffusion.github.io/

Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion (e.g., even lacking the ability to be prompted for objects moving from left to right). To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

Extensibility · 樣本 · Integration · 回合 · Microsoft Surface ·

2023 年 9 月 30 日

Diffusion Posterior Illumination for Ambiguity-aware Inverse Rendering

Linjie Lyu,Ayush Tewari,Marc Habermann,Shunsuke Saito,Michael Zollh?fer,Thomas Leimkühler,Christian Theobalt

from arxiv, SIGGRAPH Asia 2023

Inverse rendering, the process of inferring scene properties from images, is a challenging inverse problem. The task is ill-posed, as many different scene configurations can give rise to the same image. Most existing solutions incorporate priors into the inverse-rendering pipeline to encourage plausible solutions, but they do not consider the inherent ambiguities and the multi-modal distribution of possible decompositions. In this work, we propose a novel scheme that integrates a denoising diffusion probabilistic model pre-trained on natural illumination maps into an optimization framework involving a differentiable path tracer. The proposed method allows sampling from combinations of illumination and spatially-varying surface materials that are, both, natural and explain the image observations. We further conduct an extensive comparative study of different priors on illumination used in previous work on inverse rendering. Our method excels in recovering materials and producing highly realistic and diverse environment map samples that faithfully explain the illumination of the input images.

圖形處理器 · 圖 · Networking · Neural Networks · 結點 ·

2021 年 1 月 25 日

Identity-aware Graph Neural Networks

Jiaxuan You,Jonathan Gomes-Selman,Rex Ying,Jure Leskovec

from arxiv, AAAI 2021

Message passing Graph Neural Networks (GNNs) provide a powerful modeling framework for relational data. However, the expressive power of existing GNNs is upper-bounded by the 1-Weisfeiler-Lehman (1-WL) graph isomorphism test, which means GNNs that are not able to predict node clustering coefficients and shortest path distances, and cannot differentiate between different d-regular graphs. Here we develop a class of message passing GNNs, named Identity-aware Graph Neural Networks (ID-GNNs), with greater expressive power than the 1-WL test. ID-GNN offers a minimal but powerful solution to limitations of existing GNNs. ID-GNN extends existing GNN architectures by inductively considering nodes' identities during message passing. To embed a given node, ID-GNN first extracts the ego network centered at the node, then conducts rounds of heterogeneous message passing, where different sets of parameters are applied to the center node than to other surrounding nodes in the ego network. We further propose a simplified but faster version of ID-GNN that injects node identity information as augmented node features. Altogether, both versions of ID-GNN represent general extensions of message passing GNNs, where experiments show that transforming existing GNNs to ID-GNNs yields on average 40% accuracy improvement on challenging node, edge, and graph property prediction tasks; 3% accuracy improvement on node and graph classification benchmarks; and 15% ROC AUC improvement on real-world link prediction tasks. Additionally, ID-GNNs demonstrate improved or comparable performance over other task-specific graph networks.

圖形處理器 · 圖 · Better · Neural Networks · 視覺問答 ·

2020 年 3 月 31 日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Difei Gao,Ke Li,Ruiping Wang,Shiguang Shan,Xilin Chen

from arxiv, Published as a CVPR2020 paper

Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.

知識表示 · Things · 推薦系統 · MoDELS · 邊 ·

2018 年 5 月 10 日

A Unified Knowledge Representation and Context-aware Recommender System in Internet of Things

Yinhao Li,Awa Alqahtani,Ellis Solaiman,Charith Perera,Prem Prakash Jayaraman,Boualem Benatallah,Rajiv Ranjan

Within the rapidly developing Internet of Things (IoT), numerous and diverse physical devices, Edge devices, Cloud infrastructure, and their quality of service requirements (QoS), need to be represented within a unified specification in order to enable rapid IoT application development, monitoring, and dynamic reconfiguration. But heterogeneities among different configuration knowledge representation models pose limitations for acquisition, discovery and curation of configuration knowledge for coordinated IoT applications. This paper proposes a unified data model to represent IoT resource configuration knowledge artifacts. It also proposes IoT-CANE (Context-Aware recommendatioN systEm) to facilitate incremental knowledge acquisition and declarative context driven knowledge recommendation.

INTERACT · 推薦系統 · 學成 · 強化學習 · Processing（編程語言） ·

2018 年 1 月 5 日

Deep Reinforcement Learning for List-wise Recommendations

Xiangyu Zhao,Liang Zhang,Zhuoye Ding,Dawei Yin,Yihong Zhao,Jiliang Tang

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.

閱讀: 0 點贊: 0

小貼士

登錄享

相關主題

學習的學習

北京阿比特科技有限公司

注冊地址：北京市海淀區羊坊店路18號2幢3層301-191

<tr id='Kbgzq'><strong id='Ds4Eu'></strong><small id='pAR5R'></small><button id='15TXG'></button><li id='GPg9L'><noscript id='4YhwP'><big id='JlSTc'></big><dt id='cNj14'></dt></noscript></li></tr><ol id='Ckqiy'><option id='fndZd'><table id='Puh9U'><blockquote id='56hXw'><tbody id='PkHZU'></tbody></blockquote></table></option></ol><u id='eR93P'></u><kbd id='mU35d'><kbd id='wz8KF'></kbd></kbd>

<code id='tm41T'><strong id='nFc57'></strong></code>

<fieldset id='1HqZR'></fieldset>

<span id='ZCvLq'></span>

<ins id='17PB1'></ins>

<acronym id='t8MtG'><em id='mB2u3'></em><td id='cVqnT'><div id='QWBlD'></div></td></acronym><address id='H0J6Q'><big id='F1HXr'><big id='YkQlf'></big><legend id='9KoKc'></legend></big></address>

<i id='0AqX2'><div id='KEXvZ'><ins id='E6TAo'></ins></div></i>

<i id='UqF74'></i>