The study proposes and tests a technique for automated emotion recognition through mouth detection via Convolutional Neural Networks (CNN), meant to be applied for supporting people with health disorders with communication skills issues (e.g. muscle wasting, stroke, autism, or, more simply, pain) in order to recognize emotions and generate real-time feedback, or data feeding supporting systems. The software system starts the computation identifying if a face is present on the acquired image, then it looks for the mouth location and extracts the corresponding features. Both tasks are carried out using Haar Feature-based Classifiers, which guarantee fast execution and promising performance. If our previous works focused on visual micro-expressions for personalized training on a single user, this strategy aims to train the system also on generalized faces data sets.
Autonomous vehicles that operate in urban environments shall comply with existing rules and reason about the interactions with other decision-making agents. In this paper, we introduce a decentralized and communication-free interaction-aware motion planner and apply it to Autonomous Surface Vessels (ASVs) in urban canals. We build upon a sampling-based method, namely Model Predictive Path Integral control (MPPI), and employ it to, in each time instance, compute both a collision-free trajectory for the vehicle and a prediction of other agents' trajectories, thus modeling interactions. To improve the method's efficiency in multi-agent scenarios, we introduce a two-stage sample evaluation strategy and define an appropriate cost function to achieve rule compliance. We evaluate this decentralized approach in simulations with multiple vessels in real scenarios extracted from Amsterdam's canals, showing superior performance than a state-of-the-art trajectory optimization framework and robustness when encountering different types of agents.
Entropy is a fundamental concept in the field of information theory. During measurement, conventional entropy measures are susceptible to length and amplitude changes in time series. A new entropy metric, neural network entropy (NNetEn), has been developed to overcome these limitations. NNetEn entropy is computed using a modified LogNNet neural network classification model. The algorithm contains a reservoir matrix of N=19625 elements that must be filled with the given data. The contribution of this paper is threefold. Firstly, this work investigates different methods of filling the reservoir with time series (signal) elements. The reservoir filling method determines the accuracy of the entropy estimation by convolution of the study time series and LogNNet test data. The present study proposes 6 methods for filling the reservoir for time series. Two of them (Method 3 and Method 6) employ the novel approach of stretching the time series to create intermediate elements that complement it, but do not change its dynamics. The most reliable methods for short time series are Method 3 and Method 5. The second part of the study examines the influence of noise and constant bias on entropy values. Our study examines three different time series data types (chaotic, periodic, and binary) with different dynamic properties, Signal to Noise Ratio (SNR), and offsets. The NNetEn entropy calculation errors are less than 10% when SNR is greater than 30 dB, and entropy decreases with an increase in the bias component. The third part of the article analyzes real-time biosignal EEG data collected from emotion recognition experiments. The NNetEn measures show robustness under low-amplitude noise using various filters. Thus, NNetEn measures entropy effectively when applied to real-world environments with ambient noise, white noise, and 1/f noise.
Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize motions of arbitrary length that are faithful to them. We harness the power of diffusion models and present a denoising network designed specifically for the task of learning from a single input motion. Our transformer-based architecture avoids overfitting by using local attention layers that narrow the receptive field, and encourages motion diversity by using relative positional embedding. SinMDM can be applied in a variety of contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both in quality and time-space efficiency. Moreover, while current approaches require additional training for different applications, our work facilitates these applications at inference time. Our code and trained models are available at //sinmdm.github.io/SinMDM-page.
We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that take in pose estimations of a signer producing a single sign to predict not only the sign but additionally its phonological characteristics, such as the handshape. These auxiliary predictions lead to a nearly 9% absolute gain in sign recognition accuracy on the WLASL benchmark, with consistent improvements in ISLR regardless of the underlying prediction model architecture. This work has the potential to accelerate linguistic research in the domain of signed languages and reduce communication barriers between deaf and hearing people.
If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.
The recent advancements in wireless technology enable connected autonomous vehicles (CAVs) to gather data via vehicle-to-vehicle (V2V) communication, such as processed LIDAR and camera data from other vehicles. In this work, we design an integrated information sharing and safe multi-agent reinforcement learning (MARL) framework for CAVs, to take advantage of the extra information when making decisions to improve traffic efficiency and safety. We first use weight pruned convolutional neural networks (CNN) to process the raw image and point cloud LIDAR data locally at each autonomous vehicle, and share CNN-output data with neighboring CAVs. We then design a safe actor-critic algorithm that utilizes both a vehicle's local observation and the information received via V2V communication to explore an efficient behavior planning policy with safety guarantees. Using the CARLA simulator for experiments, we show that our approach improves the CAV system's efficiency in terms of average velocity and comfort under different CAV ratios and different traffic densities. We also show that our approach avoids the execution of unsafe actions and always maintains a safe distance from other vehicles. We construct an obstacle-at-corner scenario to show that the shared vision can help CAVs to observe obstacles earlier and take action to avoid traffic jams.
Visual recognition is currently one of the most important and active research areas in computer vision, pattern recognition, and even the general field of artificial intelligence. It has great fundamental importance and strong industrial needs. Deep neural networks (DNNs) have largely boosted their performances on many concrete tasks, with the help of large amounts of training data and new powerful computation resources. Though recognition accuracy is usually the first concern for new progresses, efficiency is actually rather important and sometimes critical for both academic research and industrial applications. Moreover, insightful views on the opportunities and challenges of efficiency are also highly required for the entire community. While general surveys on the efficiency issue of DNNs have been done from various perspectives, as far as we are aware, scarcely any of them focused on visual recognition systematically, and thus it is unclear which progresses are applicable to it and what else should be concerned. In this paper, we present the review of the recent advances with our suggestions on the new possible directions towards improving the efficiency of DNN-related visual recognition approaches. We investigate not only from the model but also the data point of view (which is not the case in existing surveys), and focus on three most studied data types (images, videos and points). This paper attempts to provide a systematic summary via a comprehensive survey which can serve as a valuable reference and inspire both researchers and practitioners who work on visual recognition problems.
Emotion plays an important role in detecting fake news online. When leveraging emotional signals, the existing methods focus on exploiting the emotions of news contents that conveyed by the publishers (i.e., publisher emotion). However, fake news is always fabricated to evoke high-arousal or activating emotions of people to spread like a virus, so the emotions of news comments that aroused by the crowd (i.e., social emotion) can not be ignored. Furthermore, it needs to be explored whether there exists a relationship between publisher emotion and social emotion (i.e., dual emotion), and how the dual emotion appears in fake news. In the paper, we propose Dual Emotion Features to mine dual emotion and the relationship between them for fake news detection. And we design a universal paradigm to plug it into any existing detectors as an enhancement. Experimental results on three real-world datasets indicate the effectiveness of the proposed features.
Verifiability is one of the core editing principles in Wikipedia, where editors are encouraged to provide citations for the added statements. Statements can be any arbitrary piece of text, ranging from a sentence up to a paragraph. However, in many cases, citations are either outdated, missing, or link to non-existing references (e.g. dead URL, moved content etc.). In total, 20\% of the cases such citations refer to news articles and represent the second most cited source. Even in cases where citations are provided, there are no explicit indicators for the span of a citation for a given piece of text. In addition to issues related with the verifiability principle, many Wikipedia entity pages are incomplete, with relevant information that is already available in online news sources missing. Even for the already existing citations, there is often a delay between the news publication time and the reference time. In this thesis, we address the aforementioned issues and propose automated approaches that enforce the verifiability principle in Wikipedia, and suggest relevant and missing news references for further enriching Wikipedia entity pages.
We consider the problem of zero-shot recognition: learning a visual classifier for a category with zero training examples, just using the word embedding of the category and its relationship to other categories, which visual data are provided. The key to dealing with the unfamiliar or novel category is to transfer knowledge obtained from familiar classes to describe the unfamiliar class. In this paper, we build upon the recently introduced Graph Convolutional Network (GCN) and propose an approach that uses both semantic embeddings and the categorical relationships to predict the classifiers. Given a learned knowledge graph (KG), our approach takes as input semantic embeddings for each node (representing visual category). After a series of graph convolutions, we predict the visual classifier for each category. During training, the visual classifiers for a few categories are given to learn the GCN parameters. At test time, these filters are used to predict the visual classifiers of unseen categories. We show that our approach is robust to noise in the KG. More importantly, our approach provides significant improvement in performance compared to the current state-of-the-art results (from 2 ~ 3% on some metrics to whopping 20% on a few).