精品亚洲中文一区二区三区,亚洲日本文字天天更新,免费无码一级A在线播放,国产成人拍视频精品午夜网

We study how to generate captions that are not only accurate in describing an image but also discriminative across different images. The problem is both fundamental and interesting, as most machine-generated captions, despite phenomenal research progresses in the past several years, are expressed in a very monotonic and featureless format. While such captions are normally accurate, they often lack important characteristics in human languages - distinctiveness for each caption and diversity for different images. To address this problem, we propose a novel conditional generative adversarial network for generating diverse captions across images. Instead of estimating the quality of a caption solely on one image, the proposed comparative adversarial learning framework better assesses the quality of captions by comparing a set of captions within the image-caption joint space. By contrasting with human-written captions and image-mismatched captions, the caption generator effectively exploits the inherent characteristics of human languages, and generates more discriminative captions. We show that our proposed network is capable of producing accurate and diverse captions across images.

相關內容

對抗學習

關注 28

對抗學習是一種機器學習技術，旨在通過提供欺騙性輸入來欺騙模型。最常見的原因是導致機器學習模型出現故障。大多數機器學習技術旨在處理特定的問題集，其中從相同的統計分布（IID）生成訓練和測試數據。當這些模型應用于現實世界時，對手可能會提供違反該統計假設的數據。可以安排此數據來利用特定漏洞并破壞結果。

潛變量/隱變量 · 視頻描述生成（Video Caption） · 多樣性 · 潛在 · 條件生成對抗網絡 ·

2020 年 3 月 5 日

Diverse Video Captioning Through Latent Variable Expansion with Conditional GAN

Huanhou Xiao,Jinglun Shi

Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different CNNs as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.

INFORMS · 視覺問答 · 自動問答 · MoDELS · 圖像字幕 ·

2019 年 6 月 14 日

Improving Visual Question Answering by Referring to Generated Paragraph Captions

Hyounghun Kim,Mohit Bansal

from arxiv, ACL 2019 (7 pages)

Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.

Capsule · Networking · 生成式對抗網絡 · GAN · 卷積神經網絡 ·

2018 年 11 月 20 日

Generative Adversarial Network Architectures For Image Synthesis Using Capsule Networks

Yash Upadhyay,Paul Schrater

from arxiv, Pre-print

In this paper, we propose Generative Adversarial Network (GAN) architectures that use Capsule Networks for image-synthesis. Based on the principal of positional-equivariance of features, Capsule Network's ability to encode spatial relationships between the features of the image helps it become a more powerful critic in comparison to Convolutional Neural Networks (CNNs) used in current architectures for image synthesis. Our proposed GAN architectures learn the data manifold much faster and therefore, synthesize visually accurate images in significantly lesser number of training samples and training epochs in comparison to GANs and its variants that use CNNs. Apart from analyzing the quantitative results corresponding the images generated by different architectures, we also explore the reasons for the lower coverage and diversity explored by the GAN architectures that use CNN critics.

生成式對抗網絡 · GANs · Networking · INFORMS · 超參數 ·

2018 年 10 月 25 日

Training Generative Adversarial Networks Via Turing Test

Jianlin Su

In this article, we introduce a new mode for training Generative Adversarial Networks (GANs). Rather than minimizing the distance of evidence distribution $\tilde{p}(x)$ and the generative distribution $q(x)$, we minimize the distance of $\tilde{p}(x_r)q(x_f)$ and $\tilde{p}(x_f)q(x_r)$. This adversarial pattern can be interpreted as a Turing test in GANs. It allows us to use information of real samples during training generator and accelerates the whole training procedure. We even find that just proportionally increasing the size of discriminator and generator, it succeeds on 256x256 resolution without adjusting hyperparameters carefully.

圖像字幕 · Extensibility · 判別器 · MoDELS · CASE ·

2018 年 5 月 18 日

Improving Image Captioning with Conditional Generative Adversarial Nets

Chen Chen,Shuai Mu,Wanpeng Xiao,Zexiong Ye,Liesi Wu,Fuming Ma,Qi Ju

from arxiv, 12 pages; 3 figures; 32 refenences

In this paper, we propose a novel conditional generative adversarial nets based image captioning framework as an extension of traditional reinforcement learning (RL) based encoder-decoder architecture. To deal with the inconsistent evaluation problem between objective language metrics and subjective human judgements, we are inspired to design some "discriminator" networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architecture (CNN and RNN based structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing encoder-decoder based image captioning model and we show that conventional RL training method is just a special case of our framework. Empirically, we show consistent improvements over all language evaluation metrics for different stage-of-the-art image captioning models.

GAN · 生成式對抗網絡 · 控制器 · Networking · 樣本 ·

2018 年 5 月 1 日

Controllable Generative Adversarial Network

Minhyeok Lee,Junhee Seok

Recently introduced generative adversarial network (GAN) has been shown numerous promising results to generate realistic samples. The essential task of GAN is to control the features of samples generated from a random distribution. While the current GAN structures, such as conditional GAN, successfully generate samples with desired major features, they often fail to produce detailed features that bring specific differences among samples. To overcome this limitation, here we propose a controllable GAN (ControlGAN) structure. By separating a feature classifier from a discriminator, the generator of ControlGAN is designed to learn generating synthetic samples with the specific detailed features. Evaluated with multiple image datasets, ControlGAN shows a power to generate improved samples with well-controlled features. Furthermore, we demonstrate that ControlGAN can generate intermediate features and opposite features for interpolated and extrapolated input labels that are not used in the training process. It implies that ControlGAN can significantly contribute to the variety of generated samples.

生成式對抗網絡 · 特征空間 · Networking · 高斯混合（模型） · 高斯混合模型 ·

2018 年 3 月 27 日

An Improved Evaluation Framework for Generative Adversarial Networks

Shaohui Liu,Yi Wei,Jiwen Lu,Jie Zhou

from arxiv, 21 pages, 9 figures, 8 tables

In this paper, we propose an improved quantitative evaluation framework for Generative Adversarial Networks (GANs) on generating domain-specific images, where we improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing evaluation frameworks which transfer the representation of ImageNet inception model to map images onto the feature space, our framework uses a specialized encoder to acquire fine-grained domain-specific representation. Moreover, for datasets with multiple classes, we propose Class-Aware Frechet Distance (CAFD), which employs a Gaussian mixture model on the feature space to better fit the multi-manifold feature distribution. Experiments and analysis on both the feature level and the image level were conducted to demonstrate improvements of our proposed framework over the recently proposed state-of-the-art FID method. To our best knowledge, we are the first to provide counter examples where FID gives inconsistent results with human judgments. It is shown in the experiments that our framework is able to overcome the shortness of FID and improves robustness. Code will be made available.

Re-ID · 對抗學習 · Extensibility · 學成 · Performer ·

2018 年 2 月 6 日

Adversarial Attribute-Image Person Re-identification

Zhou Yin,Wei-Shi Zheng,Ancong Wu,Hong-Xing Yu,Hai Wang,Jianhuang Lai

While attributes have been widely used for person re-identification (Re-ID) that matches the same person images across disjoint camera views, they are used either as extra features or for performing multi-task learning to assist the image-image person matching task. However, how to find a set of person images according to a given attribute description, which is very practical in many surveillance applications, remains a rarely investigated cross-modal matching problem in Person Re-ID. In this work, we present this challenge and employ adversarial learning to formulate the attribute-image cross-modal person Re-ID model. By imposing the regularization on the semantic consistency constraint across modalities, the adversarial learning enables generating image-analogous concepts for query attributes and getting it matched with image in both global level and semantic ID level. We conducted extensive experiments on three attribute datasets and demonstrated that the adversarial modelling is so far the most effective for the attributeimage cross-modal person Re-ID problem.

Pair · 聯合分布 · 生成式對抗網絡 · Networking · Re-ID ·

2018 年 1 月 4 日

Crossing Generative Adversarial Networks for Cross-View Person Re-identification

Chengyuan Zhang,Lin Wu,Yang Wang

from arxiv, 12 pages. arXiv admin note: text overlap with arXiv:1702.03431 by other authors

Person re-identification (\textit{re-id}) refers to matching pedestrians across disjoint yet non-overlapping camera views. The most effective way to match these pedestrians undertaking significant visual variations is to seek reliably invariant features that can describe the person of interest faithfully. Most of existing methods are presented in a supervised manner to produce discriminative features by relying on labeled paired images in correspondence. However, annotating pair-wise images is prohibitively expensive in labors, and thus not practical in large-scale networked cameras. Moreover, seeking comparable representations across camera views demands a flexible model to address the complex distributions of images. In this work, we study the co-occurrence statistic patterns between pairs of images, and propose to crossing Generative Adversarial Network (Cross-GAN) for learning a joint distribution for cross-image representations in a unsupervised manner. Given a pair of person images, the proposed model consists of the variational auto-encoder to encode the pair into respective latent variables, a proposed cross-view alignment to reduce the view disparity, and an adversarial layer to seek the joint distribution of latent representations. The learned latent representations are well-aligned to reflect the co-occurrence patterns of paired images. We empirically evaluate the proposed model against challenging datasets, and our results show the importance of joint invariant features in improving matching rates of person re-id with comparison to semi/unsupervised state-of-the-arts.