One of the challenges of language teaching is how to organize the rules regarding syntax, semantics, or phonology of the language in a meaningful manner. This not only requires pedagogical skills, but also requires a deep understanding of that language. While comprehensive materials to develop such curricula are available in English and some broadly spoken languages, for many other languages, teachers need to manually create them in response to their students' needs. This process is challenging because i) it requires that such experts be accessible and have the necessary resources, and ii) even if there are such experts, describing all the intricacies of a language is time-consuming and prone to omission. In this article, we present an automatic framework that aims to facilitate this process by automatically discovering and visualizing descriptions of different aspects of grammar. Specifically, we extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary) and show illustrative examples. We apply this method for teaching the Indian languages, Kannada and Marathi, which, unlike English, do not have well-developed pedagogical resources and, therefore, are likely to benefit from this exercise. To assess the perceived utility of the extracted material, we enlist the help of language educators from schools in North America who teach these languages to perform a manual evaluation. Overall, teachers find the materials to be interesting as a reference material for their own lesson preparation or even for learner evaluation.
Over the years, 2D GANs have achieved great successes in photorealistic portrait generation. However, they lack 3D understanding in the generation process, thus they suffer from multi-view inconsistency problem. To alleviate the issue, many 3D-aware GANs have been proposed and shown notable results, but 3D GANs struggle with editing semantic attributes. The controllability and interpretability of 3D GANs have not been much explored. In this work, we propose two solutions to overcome these weaknesses of 2D GANs and 3D-aware GANs. We first introduce a novel 3D-aware GAN, SURF-GAN, which is capable of discovering semantic attributes during training and controlling them in an unsupervised manner. After that, we inject the prior of SURF-GAN into StyleGAN to obtain a high-fidelity 3D-controllable generator. Unlike existing latent-based methods allowing implicit pose control, the proposed 3D-controllable StyleGAN enables explicit pose control over portrait generation. This distillation allows direct compatibility between 3D control and many StyleGAN-based techniques (e.g., inversion and stylization), and also brings an advantage in terms of computational resources. Our codes are available at //github.com/jgkwak95/SURF-GAN.
Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we design a variational inference network to build the mapping from image features to textual captions. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model, including Long Short-Term Memory (LSTM) and Transformer, and jointly optimized. Experiments on public datasets demonstrate that the proposed models, which are competitive with many state-of-the-art approaches in terms of standard evaluation metrics, can be used to both distill interpretable multi-layer semantic topics and generate diverse and coherent captions. We release our code at //github.com/DandanGuo1993/VTCM-based-image-paragraph-caption.git
While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations, (e.g., werewolves to a full moon), used as a dynamic benchmark to evaluate state-of-the-art models. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player has to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, aiming to allow future data collection that can be used to develop models with better association abilities.
Image captioning is the process of automatically generating a description of an image in natural language. Image captioning is one of the significant challenges in image understanding since it requires not only recognizing salient objects in the image but also their attributes and the way they interact. The system must then generate a syntactically and semantically correct caption that describes the image content in natural language. With the significant progress in deep learning models and their ability to effectively encode large sets of images and generate correct sentences, several neural-based captioning approaches have been proposed recently, each trying to achieve better accuracy and caption quality. This paper introduces an encoder-decoder-based image captioning system in which the encoder extracts spatial features from the image using ResNet-101. This stage is followed by a refining model, which uses an attention-on-attention mechanism to extract the visual features of the target image objects, then determine their interactions. The decoder consists of an attention-based recurrent module and a reflective attention module, which collaboratively apply attention to the visual and textual features to enhance the decoder's ability to model long-term sequential dependencies. Extensive experiments performed on Flickr30K, show the effectiveness of the proposed approach and the high quality of the generated captions.
We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.
Developing robust vision-guided controllers for quadrupedal robots in complex environments, with various obstacles, dynamical surroundings and uneven terrains, is very challenging. While Reinforcement Learning (RL) provides a promising paradigm for agile locomotion skills with vision inputs in simulation, it is still very challenging to deploy the RL policy in the real world. Our key insight is that aside from the discrepancy in the domain gap, in visual appearance between the simulation and the real world, the latency from the control pipeline is also a major cause of difficulty. In this paper, we propose Multi-Modal Delay Randomization (MMDR) to address this issue when training RL agents. Specifically, we simulate the latency of real hardware by using past observations, sampled with randomized periods, for both proprioception and vision. We train the RL policy for end-to-end control in a physical simulator without any predefined controller or reference motion, and directly deploy it on the real A1 quadruped robot running in the wild. We evaluate our method in different outdoor environments with complex terrains and obstacles. We demonstrate the robot can smoothly maneuver at a high speed, avoid the obstacles, and show significant improvement over the baselines. Our project page with videos is at //mehooz.github.io/mmdr-wild/.
Explaining the decisions of an Artificial Intelligence (AI) model is increasingly critical in many real-world, high-stake applications. Hundreds of papers have either proposed new feature attribution methods, discussed or harnessed these tools in their work. However, despite humans being the target end-users, most attribution methods were only evaluated on proxy automatic-evaluation metrics (Zhang et al. 2018; Zhou et al. 2016; Petsiuk et al. 2018). In this paper, we conduct the first user study to measure attribution map effectiveness in assisting humans in ImageNet classification and Stanford Dogs fine-grained classification, and when an image is natural or adversarial (i.e., contains adversarial perturbations). Overall, feature attribution is surprisingly not more effective than showing humans nearest training-set examples. On a harder task of fine-grained dog categorization, presenting attribution maps to humans does not help, but instead hurts the performance of human-AI teams compared to AI alone. Importantly, we found automatic attribution-map evaluation measures to correlate poorly with the actual human-AI team performance. Our findings encourage the community to rigorously test their methods on the downstream human-in-the-loop applications and to rethink the existing evaluation metrics.
Training a deep architecture using a ranking loss has become standard for the person re-identification task. Increasingly, these deep architectures include additional components that leverage part detections, attribute predictions, pose estimators and other auxiliary information, in order to more effectively localize and align discriminative image regions. In this paper we adopt a different approach and carefully design each component of a simple deep architecture and, critically, the strategy for training it effectively for person re-identification. We extensively evaluate each design choice, leading to a list of good practices for person re-identification. By following these practices, our approach outperforms the state of the art, including more complex methods with auxiliary components, by large margins on four benchmark datasets. We also provide a qualitative analysis of our trained representation which indicates that, while compact, it is able to capture information from localized and discriminative regions, in a manner akin to an implicit attention mechanism.
Most of the internet today is composed of digital media that includes videos and images. With pixels becoming the currency in which most transactions happen on the internet, it is becoming increasingly important to have a way of browsing through this ocean of information with relative ease. YouTube has 400 hours of video uploaded every minute and many million images are browsed on Instagram, Facebook, etc. Inspired by recent advances in the field of deep learning and success that it has gained on various problems like image captioning and, machine translation , word2vec , skip thoughts, etc, we present DeepSeek a natural language processing based deep learning model that allows users to enter a description of the kind of images that they want to search, and in response the system retrieves all the images that semantically and contextually relate to the query. Two approaches are described in the following sections.
Most previous event extraction studies have relied heavily on features derived from annotated event mentions, thus cannot be applied to new event types without annotation effort. In this work, we take a fresh look at event extraction and model it as a grounding problem. We design a transferable neural architecture, mapping event mentions and types jointly into a shared semantic space using structural and compositional neural networks, where the type of each event mention can be determined by the closest of all candidate types . By leveraging (1)~available manual annotations for a small set of existing event types and (2)~existing event ontologies, our framework applies to new event types without requiring additional annotation. Experiments on both existing event types (e.g., ACE, ERE) and new event types (e.g., FrameNet) demonstrate the effectiveness of our approach. \textit{Without any manual annotations} for 23 new event types, our zero-shot framework achieved performance comparable to a state-of-the-art supervised model which is trained from the annotations of 500 event mentions.