Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $ \text{Sep-TFAnet}^{\text{VAD}}$, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of $ \text{Sep-TFAnet}^{\text{VAD}}$ and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: //Sep-TFAnet.github.io
This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary conditioning signals. Recently, DDPM-based neural vocoders have gained prominence as non-autoregressive models that can generate high-quality waveforms. The neural vocoders based on DDPM have the advantage of training with a simple time-domain loss. In practical applications, such as singing voice synthesis, there is a demand for neural vocoders to generate high-fidelity speech waveforms with flexible pitch control. However, conventional DDPM-based neural vocoders struggle to generate speech waveforms under such conditions. Our proposed model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals. Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
We present for the first time a novel method that utilizes the chest movement-modulated radio signals for non-contact acquisition of the photoplethysmography (PPG) signal. Under the proposed method, a software-defined radio (SDR) exposes the chest of a subject sitting nearby to an orthogonal frequency division multiplexing signal with 64 sub-carriers at a center frequency 5.24 GHz, while another SDR in the close vicinity collects the modulated radio signal reflected off the chest. This way, we construct a custom dataset by collecting 160 minutes of labeled data (both raw radio data as well as the reference PPG signal) from 16 healthy young subjects. With this, we first utilize principal component analysis for dimensionality reduction of the radio data. Next, we denoise the radio signal and reference PPG signal using wavelet technique, followed by segmentation and Z-score normalization. We then synchronize the radio and PPG segments using cross-correlation method. Finally, we proceed to the waveform translation (regression) task, whereby we first convert the radio and PPG segments into frequency domain using discrete cosine transform (DCT), and then learn the non-linear regression between them. Eventually, we reconstruct the synthetic PPG signal by taking inverse DCT of the output of regression block, with a mean absolute error of 8.1294. The synthetic PPG waveform has a great clinical significance as it could be used for non-contact performance assessment of cardiovascular and respiratory systems of patients suffering from infectious diseases, e.g., covid19.
We propose two novel extensions of the Wyner common information optimization problem. Each relaxes one fundamental constraints in Wyner's formulation. The \textit{Variational Wyner Common Information} relaxes the matching constraint to the known distribution while imposing conditional independence to the feasible solution set. We derive a tight surrogate upper bound of the obtained unconstrained Lagrangian via the theory of variational inference, which can be minimized efficiently. Our solver caters to problems where conditional independence holds with significantly reduced computation complexity; On the other hand, the \textit{Bipartite Wyner Common Information} relaxes the conditional independence constraint whereas the matching condition is enforced on the feasible set. By leveraging the difference-of-convex structure of the formulated optimization problem, we show that our solver is resilient to conditional dependent sources. Both solvers are provably convergent (local stationary points), and empirically, they obtain more accurate solutions to Wyner's formulation with substantially less runtime. Moreover, them can be extended to unknown distribution settings by parameterizing the common randomness as a member of the exponential family of distributions. Our approaches apply to multi-modal clustering problems, where multiple modalities of observations come from the same cluster. Empirically, our solvers outperform the state-of-the-art multi-modal clustering algorithms with significantly improved performance.
Risk mitigation techniques are critical to avoiding accidents associated with driving behaviour. We provide a novel Multi-Class Driver Distraction Risk Assessment (MDDRA) model that considers the vehicle, driver, and environmental data during a journey. MDDRA categorises the driver on a risk matrix as safe, careless, or dangerous. It offers flexibility in adjusting the parameters and weights to consider each event on a specific severity level. We collect real-world data using the Field Operation Test (TeleFOT), covering drivers using the same routes in the East Midlands, United Kingdom (UK). The results show that reducing road accidents caused by driver distraction is possible. We also study the correlation between distraction (driver, vehicle, and environment) and the classification severity based on a continuous distraction severity score. Furthermore, we apply machine learning techniques to classify and predict driver distraction according to severity levels to aid the transition of control from the driver to the vehicle (vehicle takeover) when a situation is deemed risky. The Ensemble Bagged Trees algorithm performed best, with an accuracy of 96.2%.
Few-shot dialogue state tracking (DST) with Large Language Models (LLM) relies on an effective and efficient conversation retriever to find similar in-context examples for prompt learning. Previous works use raw dialogue context as search keys and queries, and a retriever is fine-tuned with annotated dialogues to achieve superior performance. However, the approach is less suited for scaling to new domains or new annotation languages, where fine-tuning data is unavailable. To address this problem, we handle the task of conversation retrieval based on text summaries of the conversations. A LLM-based conversation summarizer is adopted for query and key generation, which enables effective maximum inner product search. To avoid the extra inference cost brought by LLM-based conversation summarization, we further distill a light-weight conversation encoder which produces query embeddings without decoding summaries for test conversations. We validate our retrieval approach on MultiWOZ datasets with GPT-Neo-2.7B and LLaMA-7B/30B. The experimental results show a significant improvement over relevant baselines in real few-shot DST settings.
Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.
Convolutional neural networks (CNNs) have shown dramatic improvements in single image super-resolution (SISR) by using large-scale external samples. Despite their remarkable performance based on the external dataset, they cannot exploit internal information within a specific image. Another problem is that they are applicable only to the specific condition of data that they are supervised. For instance, the low-resolution (LR) image should be a "bicubic" downsampled noise-free image from a high-resolution (HR) one. To address both issues, zero-shot super-resolution (ZSSR) has been proposed for flexible internal learning. However, they require thousands of gradient updates, i.e., long inference time. In this paper, we present Meta-Transfer Learning for Zero-Shot Super-Resolution (MZSR), which leverages ZSSR. Precisely, it is based on finding a generic initial parameter that is suitable for internal learning. Thus, we can exploit both external and internal information, where one single gradient update can yield quite considerable results. (See Figure 1). With our method, the network can quickly adapt to a given image condition. In this respect, our method can be applied to a large spectrum of image conditions within a fast adaptation process.
We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.
We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second). We apply our iterative optimization-based attack to Mozilla's implementation DeepSpeech end-to-end, and show it has a 100% success rate. The feasibility of this attack introduce a new domain to study adversarial examples.