While attributes have been widely used for person re-identification (Re-ID) that matches the same person images across disjoint camera views, they are used either as extra features or for performing multi-task learning to assist the image-image person matching task. However, how to find a set of person images according to a given attribute description, which is very practical in many surveillance applications, remains a rarely investigated cross-modal matching problem in Person Re-ID. In this work, we present this challenge and employ adversarial learning to formulate the attribute-image cross-modal person Re-ID model. By imposing the regularization on the semantic consistency constraint across modalities, the adversarial learning enables generating image-analogous concepts for query attributes and getting it matched with image in both global level and semantic ID level. We conducted extensive experiments on three attribute datasets and demonstrated that the adversarial modelling is so far the most effective for the attributeimage cross-modal person Re-ID problem.
For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publicly-available language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-the-art cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2% to 5% in terms of rank-1 accuracy.
Person re-identification (re-ID) has attracted much attention recently due to its great importance in video surveillance. In general, distance metrics used to identify two person images are expected to be robust under various appearance changes. However, our work observes the extreme vulnerability of existing distance metrics to adversarial examples, generated by simply adding human-imperceptible perturbations to person images. Hence, the security danger is dramatically increased when deploying commercial re-ID systems in video surveillance, especially considering the highly strict requirement of public safety. Although adversarial examples have been extensively applied for classification analysis, it is rarely studied in metric analysis like person re-identification. The most likely reason is the natural gap between the training and testing of re-ID networks, that is, the predictions of a re-ID network cannot be directly used during testing without an effective metric. In this work, we bridge the gap by proposing Adversarial Metric Attack, a parallel methodology to adversarial classification attacks, which can effectively generate adversarial examples for re-ID. Comprehensive experiments clearly reveal the adversarial effects in re-ID systems. Moreover, by benchmarking various adversarial settings, we expect that our work can facilitate the development of robust feature learning with the experimental conclusions we have drawn.
Person re-identification (Person ReID) is a challenging task due to the large variations in camera viewpoint, lighting, resolution, and human pose. Recently, with the advancement of deep learning technologies, the performance of Person ReID has been improved swiftly. Feature extraction and feature matching are two crucial components in the training and deployment stages of Person ReID. However, many existing Person ReID methods have measure inconsistency between the training stage and the deployment stage, and they couple magnitude and orientation information of feature vectors in feature representation. Meanwhile, traditional triplet loss methods focus on samples within a mini-batch and lack knowledge of global feature distribution. To address these issues, we propose a novel homocentric hypersphere embedding scheme to decouple magnitude and orientation information for both feature and weight vectors, and reformulate classification loss and triplet loss to their angular versions and combine them into an angular discriminative loss. We evaluate our proposed method extensively on the widely used Person ReID benchmarks, including Market1501, CUHK03 and DukeMTMC-ReID. Our method demonstrates leading performance on all datasets.
Person re-identification (ReID) aims at matching persons across different views/scenes. In addition to accuracy, the matching efficiency has received more and more attention because of demanding applications using large-scale data. Several binary coding based methods have been proposed for efficient ReID, which either learn projections to map high-dimensional features to compact binary codes, or directly adopt deep neural networks by simply inserting an additional fully-connected layer with tanh-like activations. However, the former approach requires time-consuming hand-crafted feature extraction and complicated (discrete) optimizations; the latter lacks the necessary discriminative information greatly due to the straightforward activation functions. In this paper, we propose a simple yet effective framework for efficient ReID inspired by the recent advances in adversarial learning. Specifically, instead of learning explicit projections or adding fully-connected mapping layers, the proposed Adversarial Binary Coding (ABC) framework guides the extraction of binary codes implicitly and effectively. The discriminability of the extracted codes is further enhanced by equipping the ABC with a deep triplet network for the ReID task. More importantly, the ABC and triplet network are simultaneously optimized in an end-to-end manner. Extensive experiments on three large-scale ReID benchmarks demonstrate the superiority of our approach over the state-of-the-art methods.
Person re-identification plays an important role in realistic video surveillance with increasing demand for public safety. In this paper, we propose a novel framework with rules of updating images for person re-identification in real-world surveillance system. First, Image Pool is generated by using mean-shift tracking method to automatically select video frame fragments of the target person. Second, features extracted from Image Pool by convolutional network work together to re-rank original ranking list of the main image and matching results will be generated. In addition, updating rules are designed for replacing images in Image Pool when a new image satiating with our updating critical formula in video system. These rules fall into two categories: if the new image is from the same camera as the previous updated image, it will replace one of assist images; otherwise, it will replace the main image directly. Experiments are conduced on Market-1501, iLIDS-VID and PRID-2011 and our ITSD datasets to validate that our framework outperforms on rank-1 accuracy and mAP for person re-identification. Furthermore, the update ability of our framework provides consistently remarkable accuracy rate in real-world surveillance system.
Person Re-identification (re-id) faces two major challenges: the lack of cross-view paired training data and learning discriminative identity-sensitive and view-invariant features in the presence of large pose variations. In this work, we address both problems by proposing a novel deep person image generation model for synthesizing realistic person images conditional on pose. The model is based on a generative adversarial network (GAN) and used specifically for pose normalization in re-id, thus termed pose-normalization GAN (PN-GAN). With the synthesized images, we can learn a new type of deep re-id feature free of the influence of pose variations. We show that this feature is strong on its own and highly complementary to features learned with the original images. Importantly, we now have a model that generalizes to any new re-id dataset without the need for collecting any training data for model fine-tuning, thus making a deep re-id model truly scalable. Extensive experiments on five benchmarks show that our model outperforms the state-of-the-art models, often significantly. In particular, the features learned on Market-1501 can achieve a Rank-1 accuracy of 68.67% on VIPeR without any model fine-tuning, beating almost all existing models fine-tuned on the dataset.
In recent years, person re-identification (re-id) catches great attention in both computer vision community and industry. In this paper, we propose a new framework for person re-identification with a triplet-based deep similarity learning using convolutional neural networks (CNNs). The network is trained with triplet input: two of them have the same class labels and the other one is different. It aims to learn the deep feature representation, with which the distance within the same class is decreased, while the distance between the different classes is increased as much as possible. Moreover, we trained the model jointly on six different datasets, which differs from common practice - one model is just trained on one dataset and tested also on the same one. However, the enormous number of possible triplet data among the large number of training samples makes the training impossible. To address this challenge, a double-sampling scheme is proposed to generate triplets of images as effective as possible. The proposed framework is evaluated on several benchmark datasets. The experimental results show that, our method is effective for the task of person re-identification and it is comparable or even outperforms the state-of-the-art methods.
Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.
Person re-identification (re-ID) models trained on one domain often fail to generalize well to another. In our attempt, we present a "learning via translation" framework. In the baseline, we translate the labeled images from source to target domain in an unsupervised manner. We then train re-ID models with the translated images by supervised methods. Yet, being an essential part of this framework, unsupervised image-image translation suffers from the information loss of source-domain labels during translation. Our motivation is two-fold. First, for each image, the discriminative cues contained in its ID label should be maintained after translation. Second, given the fact that two domains have entirely different persons, a translated image should be dissimilar to any of the target IDs. To this end, we propose to preserve two types of unsupervised similarities, 1) self-similarity of an image before and after translation, and 2) domain-dissimilarity of a translated source image and a target image. Both constraints are implemented in the similarity preserving generative adversarial network (SPGAN) which consists of a Siamese network and a CycleGAN. Through domain adaptation experiment, we show that images generated by SPGAN are more suitable for domain adaptation and yield consistent and competitive re-ID accuracy on two large-scale datasets.
Person re-identification (\textit{re-id}) refers to matching pedestrians across disjoint yet non-overlapping camera views. The most effective way to match these pedestrians undertaking significant visual variations is to seek reliably invariant features that can describe the person of interest faithfully. Most of existing methods are presented in a supervised manner to produce discriminative features by relying on labeled paired images in correspondence. However, annotating pair-wise images is prohibitively expensive in labors, and thus not practical in large-scale networked cameras. Moreover, seeking comparable representations across camera views demands a flexible model to address the complex distributions of images. In this work, we study the co-occurrence statistic patterns between pairs of images, and propose to crossing Generative Adversarial Network (Cross-GAN) for learning a joint distribution for cross-image representations in a unsupervised manner. Given a pair of person images, the proposed model consists of the variational auto-encoder to encode the pair into respective latent variables, a proposed cross-view alignment to reduce the view disparity, and an adversarial layer to seek the joint distribution of latent representations. The learned latent representations are well-aligned to reflect the co-occurrence patterns of paired images. We empirically evaluate the proposed model against challenging datasets, and our results show the importance of joint invariant features in improving matching rates of person re-id with comparison to semi/unsupervised state-of-the-arts.