The aim of this study is to determine the effect of language varieties on the spectral distribution of stressed and unstressed sonorants (nasals /m, n/, lateral approximants /l/, and rhotics /r/) and on their coarticulatory effects on adjacent sounds. To quantify the shape of the spectral distribution, we calculated the spectral moments from the sonorant spectra of nasals /m, n/, lateral approximants /l/, and rhotics /r/ produced by Athenian Greek and Cypriot Greek speakers. To estimate the co-articulatory effects of sonorants on the adjacent vowels' F1 - F4 formant frequencies, we developed polynomial models of the adjacent vowel's formant contours. We found significant effects of language variety (sociolinguistic information) on the spectral moments of each sonorant /m/, /n/, /l/, /r/ (except between /m/ and /n/) and on the formant contours of the adjacent vowel. All sonorants (including /m/ and /n/) had distinct effects on adjacent vowel's formant contours, especially for F3 and F4. The study highlights that the combination of spectral moments and coarticulatory effects of sonorants determines linguistic (stress and phonemic category) and sociolinguistic (language variety) characteristics of sonorants. It also provides the first comparative acoustic analysis of Athenian Greek and Cypriot Greek sonorants.
Optimal Markov Decision Process policies for problems with finite state and action space are identified through a partial ordering by comparing the value function across states. This is referred to as state-based optimality. This paper identifies when such optimality guarantees some form of system-based optimality as measured by a scalar. Four such system-based metrics are introduced. Uni-variate empirical distributions of these metrics are obtained through simulation as to assess whether theoretically optimal policies provide a statistically significant advantage. This has been conducted using a Student's t-test, Welch's $t$-test and a Mann-Whitney $U$-test. The proposed method is applied to a common problem in queuing theory: admission control.
Literacy assessment is an important activity for education administrators across the globe. Typically achieved in a school setting by testing a child's oral reading, it is intensive in human resources. While automatic speech recognition (ASR) is a potential solution to the problem, it tends to be computationally expensive for hand-held devices apart from needing language and accent-specific speech for training. In this work, we propose a system to predict the word-decoding skills of a student based on simple acoustic features derived from the recording. We first identify a meaningful categorization of word-decoding skills by analyzing a manually transcribed data set of children's oral reading recordings. Next the automatic prediction of the category is attempted with the proposed acoustic features. Pause statistics, syllable rate and spectral and intensity dynamics are found to be reliable indicators of specific types of oral reading deficits, providing useful feedback by discriminating the different characteristics of beginning readers. This computationally simple and language-agnostic approach is found to provide a performance close to that obtained using a language dependent ASR that required considerable tuning of its parameters.
Audio-visual speaker diarization aims at detecting ``who spoken when`` using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD. This benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. Yet, how to deal with off-screen and on-screen speakers together still remains a critical challenge. To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility. Experiments have shown that our method not only can outperform state-of-the-art methods but also is more robust as varying the ratio of off-screen speakers. Ablation studies demonstrate the advantages of the proposed AVR-Net and especially the modality mask on diarization. Our data and code will be made publicly available at //github.com/zcxu-eric/AVA-AVD.
Lightning is a destructive and highly visible product of severe storms, yet there is still much to be learned about the conditions under which lightning is most likely to occur. The GOES-16 and GOES-17 satellites, launched in 2016 and 2018 by NOAA and NASA, collect a wealth of data regarding individual lightning strike occurrence and potentially related atmospheric variables. The acute nature and inherent spatial correlation in lightning data renders standard regression analyses inappropriate. Further, computational considerations are foregrounded by the desire to analyze the immense and rapidly increasing volume of lightning data. We present a new computationally feasible method that combines spectral and Laplace approximations in an EM algorithm, denoted SLEM, to fit the widely popular log-Gaussian Cox process model to large spatial point pattern datasets. In simulations, we find SLEM is competitive with contemporary techniques in terms of speed and accuracy. When applied to two lightning datasets, SLEM provides better out-of-sample prediction scores and quicker runtimes, suggesting its particular usefulness for analyzing lightning data, which tend to have sparse signals.
The importance of the working document is that it allows the analysis of information and cases associated with (SARS-CoV-2) COVID-19, based on the daily information generated by the Government of Mexico through the Secretariat of Health, responsible for the Epidemiological Surveillance System for Viral Respiratory Diseases (SVEERV). The information in the SVEERV is disseminated as open data, and the level of information is displayed at the municipal, state and national levels. On the other hand, the monitoring of the genomic surveillance of (SARS-CoV-2) COVID-19, through the identification of variants and mutations, is registered in the database of the Information System of the Global Initiative on Sharing All Influenza Data (GISAID) based in Germany. These two sources of information SVEERV and GISAID provide the information for the analysis of the impact of (SARS-CoV-2) COVID-19 on the population in Mexico. The first data source identifies information, at the national level, on patients according to age, sex, comorbidities and COVID-19 presence (SARS-CoV-2), among other characteristics. The data analysis is carried out by means of the design of an algorithm applying data mining techniques and methodology, to estimate the case fatality rate, positivity index and identify a typology according to the severity of the infection identified in patients who present a positive result. for (SARS-CoV-2) COVID-19. From the second data source, information is obtained worldwide on the new variants and mutations of COVID-19 (SARS-CoV-2), providing valuable information for timely genomic surveillance. This study analyzes the impact of (SARS-CoV-2) COVID-19 on the indigenous language-speaking population, it allows us to provide information, quickly and in a timely manner, to support the design of public policy on health.
We study the use of the Wave-U-Net architecture for speech enhancement, a model introduced by Stoller et al for the separation of music vocals and accompaniment. This end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. Our experiments show that the proposed method improves several metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, over the state-of-the-art with respect to the speech enhancement task on the Voice Bank corpus (VCTK) dataset. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. We see this initial result as an encouraging signal to further explore speech enhancement in the time-domain, both as an end in itself and as a pre-processing step to speech recognition systems.
Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.
The pervasive use of social media provides massive data about individuals' online social activities and their social relations. The building block of most existing recommendation systems is the similarity between users with social relations, i.e., friends. While friendship ensures some homophily, the similarity of a user with her friends can vary as the number of friends increases. Research from sociology suggests that friends are more similar than strangers, but friends can have different interests. Exogenous information such as comments and ratings may help discern different degrees of agreement (i.e., congruity) among similar users. In this paper, we investigate if users' congruity can be incorporated into recommendation systems to improve it's performance. Experimental results demonstrate the effectiveness of embedding congruity related information into recommendation systems.
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER.
With the ever-growing volume, complexity and dynamicity of online information, recommender system has been an effective key solution to overcome such information overload. In recent years, deep learning's revolutionary advances in speech recognition, image analysis and natural language processing have gained significant attention. Meanwhile, recent studies also demonstrate its effectiveness in coping with information retrieval and recommendation tasks. Applying deep learning techniques into recommender system has been gaining momentum due to its state-of-the-art performances and high-quality recommendations. In contrast to traditional recommendation models, deep learning provides a better understanding of user's demands, item's characteristics and historical interactions between them. This article aims to provide a comprehensive review of recent research efforts on deep learning based recommender systems towards fostering innovations of recommender system research. A taxonomy of deep learning based recommendation models is presented and used to categorize the surveyed articles. Open problems are identified based on the analytics of the reviewed works and potential solutions discussed.