精品夜色国产国偷自产乱码_国产精品大秀视频日韩无码_六六电影院理论片_国产精品厕所综合区_囯自拍视频在线观看_一二三四视频社区欧美一区_91人人妻人人澡人人爽精品

Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers. Although good quality of the converted speech can be observed when VC is applied in a clean environment, the quality degrades drastically when the system is run in noisy conditions. In order to address this issue, we propose a novel speech enhancement (SE)-assisted VC system that utilizes the SE techniques for signal pre-processing, where the VC and SE components are optimized in an joint training strategy with the aim to provide high-quality converted speech signals. We adopt a popular model, StarGAN, as the VC component and thus call the combined system as EStarGAN. We test the proposed EStarGAN system using a Mandarin speech corpus. The experimental results first verified the effectiveness of joint training strategy used in EStarGAN. Moreover, EStarGAN demonstrated performance robustness in various unseen noisy environments. The subjective listening test results further showed that EStarGAN can improve the sound quality of speech signals converted from noise-corrupted source utterances.

相關內容

回合

關注 3

Performance · MoDELS · Attention · 可約的 · HTTPS ·

2023 年 3 月 13 日

Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR

Feng Li,Ailing Zeng,Shilong Liu,Hao Zhang,Hongyang Li,Lei Zhang,Lionel M. Ni

from arxiv, CVPR 2023

Recent DEtection TRansformer-based (DETR) models have obtained remarkable performance. Its success cannot be achieved without the re-introduction of multi-scale feature fusion in the encoder. However, the excessively increased tokens in multi-scale features, especially for about 75\% of low-level features, are quite computationally inefficient, which hinders real applications of DETR models. In this paper, we present Lite DETR, a simple yet efficient end-to-end object detection framework that can effectively reduce the GFLOPs of the detection head by 60\% while keeping 99\% of the original performance. Specifically, we design an efficient encoder block to update high-level features (corresponding to small-resolution feature maps) and low-level features (corresponding to large-resolution feature maps) in an interleaved way. In addition, to better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights. Comprehensive experiments validate the effectiveness and efficiency of the proposed Lite DETR, and the efficient encoder strategy can generalize well across existing DETR-based models. The code will be available in \url{//github.com/IDEA-Research/Lite-DETR}.

語音增強 · Continuity · 回合 · Backbone · 語音合成 ·

2023 年 3 月 13 日

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Rodrigo Mira,Buye Xu,Jacob Donley,Anurag Kumar,Stavros Petridis,Vamsi Krishna Ithapu,Maja Pantic

from arxiv, accepted to ICASSP 2023

Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios.

模型評估 · 回合 · 語音增強 · 噪聲 · MoDELS ·

2023 年 3 月 12 日

Improving the Intent Classification accuracy in Noisy Environment

Mohamed Nabih Ali,Alessio Brutti,Daniele Falavigna

Intent classification is a fundamental task in the spoken language understanding field that has recently gained the attention of the scientific community, mainly because of the feasibility of approaching it with end-to-end neural models. In this way, avoiding using intermediate steps, i.e. automatic speech recognition, is possible, thus the propagation of errors due to background noise, spontaneous speech, speaking styles of users, etc. Towards the development of solutions applicable in real scenarios, it is interesting to investigate how environmental noise and related noise reduction techniques to address the intent classification task with end-to-end neural models. In this paper, we experiment with a noisy version of the fluent speech command data set, combining the intent classifier with a time-domain speech enhancement solution based on Wave-U-Net and considering different training strategies. Experimental results reveal that, for this task, the use of speech enhancement greatly improves the classification accuracy in noisy conditions, in particular when the classification model is trained on enhanced signals.

Performer · 暫退法 · MoDELS · Learning · 聯邦學習 ·

2023 年 3 月 11 日

Stabilizing and Improving Federated Learning with Non-IID Data and Client Dropout in IoT Systems

Jian Xu,Meiling Yang,Wenbo Ding,Shao-Lun Huang

Federated learning is an emerging technique for training deep models over decentralized clients without exposing private data, which however suffers from label distribution skew and usually results in slow convergence and degraded model performance. This challenge could be more serious when the participating clients are in unstable circumstances and dropout frequently. Previous work and our empirical observations demonstrate that the classifier head for classification task is more sensitive to label skew and the unstable performance of FedAvg mainly lies in the imbalanced training samples across different classes. The biased classifier head will also impact the learning of feature representations. Therefore, maintaining a balanced classifier head is of significant importance for building a better global model. To tackle this issue, we propose a simple yet effective framework by introducing a prior-calibrated softmax function for computing the cross-entropy loss and a prototype-based feature augmentation scheme to re-balance the local training, which are lightweight for edge devices and can facilitate the global model aggregation. With extensive experiments performed on FashionMNIST and CIFAR-10 datasets, we demonstrate the improved model performance of our method over existing baselines in the presence of non-IID data and client dropout.

連結 · MoDELS · 特征提取 · 潛在 · 可約的 ·

2023 年 3 月 10 日

Accurate Real-time Polyp Detection in Videos from Concatenation of Latent Features Extracted from Consecutive Frames

Hemin Ali Qadir,Younghak Shin,Jacob Bergsland,Ilangko Balasingham

An efficient deep learning model that can be implemented in real-time for polyp detection is crucial to reducing polyp miss-rate during screening procedures. Convolutional neural networks (CNNs) are vulnerable to small changes in the input image. A CNN-based model may miss the same polyp appearing in a series of consecutive frames and produce unsubtle detection output due to changes in camera pose, lighting condition, light reflection, etc. In this study, we attempt to tackle this problem by integrating temporal information among neighboring frames. We propose an efficient feature concatenation method for a CNN-based encoder-decoder model without adding complexity to the model. The proposed method incorporates extracted feature maps of previous frames to detect polyps in the current frame. The experimental results demonstrate that the proposed method of feature concatenation improves the overall performance of automatic polyp detection in videos. The following results are obtained on a public video dataset: sensitivity 90.94\%, precision 90.53\%, and specificity 92.46%

分解的 · motivation · CASES · 設計 · 正則化項 ·

2023 年 3 月 10 日

Enumeration of regular fractional factorial designs with four-level and two-level factors

Alexandre Bohyn,Eric D. Schoen,Peter Goos

from arxiv, 37 pages, 1 figure, 13 tables

Designs for screening experiments usually include factors with two levels only. Adding a few four-level factors allows for the inclusion of multi-level categorical factors or quantitative factors with possible quadratic or third-order effects. Three examples motivated us to generate a large catalog of designs with two-level factors as well as four-level factors. To create the catalog, we considered three methods. In the first method, we select designs using a search table, and in the second method, we use a procedure that selects candidate designs based on the properties of their projections into fewer factors. The third method is actually a benchmark method, in which we use a general orthogonal array enumeration algorithm. We compare the efficiencies of the new methods for generating complete sets of non-isomorphic designs. Finally, we use the most efficient method to generate a catalog of designs with up to three four-level factors and up to 20 two-level factors for run sizes 16, 32, 64, and 128. In some cases, a complete enumeration was infeasible. For these cases, we used a bounded enumeration strategy instead. We demonstrate the usefulness of the catalog by revisiting the motivating examples.

3D · 易處理的 · state-of-the-art · Performer · 統計量 ·

2023 年 3 月 9 日

3D wind field profiles from hyperspectral sounders: revisiting optic-flow from a meteorological perspective

P. Héas,O. Hautecoeur,R. Borde

In this work, we present an efficient optic flow algorithm for the extraction of vertically resolved 3D atmospheric motion vector (AMV) fields from incomplete hyperspectral image data measures by infrared sounders. The model at the heart of the energy to be minimized is consistent with atmospheric dynamics, incorporating ingredients of thermodynamics, hydrostatic equilibrium and statistical turbulence. Modern optimization techniques are deployed to design a low-complexity solver for the energy minimization problem, which is non-convex, non-differentiable, high-dimensional and subject to physical constraints. In particular, taking advantage of the alternate direction of multipliers methods (ADMM), we show how to split the original high-dimensional problem into a recursion involving a set of standard and tractable optic-flow sub-problems. By comparing with the ground truth provided by the operational numerical simulation of the European Centre for Medium-Range Weather Forecasts (ECMWF), we show that the performance of the proposed method is superior to state-of-the-art optical flow algorithms in the context of real infrared atmospheric sounding interferometer (IASI) observations.

TSE · SC · Performer · Networking · 端到端 ·

2023 年 3 月 9 日

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Kai Liu,Ziqing Du,Xucheng Wan,Huan Zhou

from arxiv, Accepted by ICASSP 2023

Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such experience problem is wrong speaker extraction (called speaker confusion, SC), which leads to strong negative experience and hampers effective conversations. To mitigate the imperative SC issue, we reformulate the training objective and propose two novel loss schemes that explore the metric of reconstruction improvement performance defined at small chunk-level and leverage the metric associated distribution information. Both loss schemes aim to encourage a TSE network to pay attention to those SC chunks based on the said distribution information. On this basis, we present X-SepFormer, an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer. Experimental results on the benchmark WSJ0-2mix dataset validate the effectiveness of our proposals, showing consistent improvements on SC errors (by 14.8% relative). Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.81, our best system significantly outperforms the current SOTA systems and offers the top TSE results reported till date on the WSJ0-2mix.

Performer · 學成 · Boosting（一種模型訓練加速方式） · MoDELS · 可辨認的 ·

2021 年 12 月 22 日

Hybrid Curriculum Learning for Emotion Recognition in Conversation

Lin Yang,Yi Shen,Yue Mao,Longjun Cai

from arxiv, Accepted by AAAI-2022

Emotion recognition in conversation (ERC) aims to detect the emotion label for each utterance. Motivated by recent studies which have proven that feeding training examples in a meaningful order rather than considering them randomly can boost the performance of models, we propose an ERC-oriented hybrid curriculum learning framework. Our framework consists of two curricula: (1) conversation-level curriculum (CC); and (2) utterance-level curriculum (UC). In CC, we construct a difficulty measurer based on "emotion shift" frequency within a conversation, then the conversations are scheduled in an "easy to hard" schema according to the difficulty score returned by the difficulty measurer. For UC, it is implemented from an emotion-similarity perspective, which progressively strengthens the model's ability in identifying the confusing emotions. With the proposed model-agnostic hybrid curriculum learning strategy, we observe significant performance boosts over a wide range of existing ERC models and we are able to achieve new state-of-the-art results on four public ERC datasets.

語音增強 · 估計/估計量 · MoDELS · 損失函數（機器學習） · Performer ·

2019 年 3 月 7 日

Phase-aware Speech Enhancement with Deep Complex U-Net

Hyeong-Seok Choi,Jang-Hyun Kim,Jaesung Huh,Adrian Kim,Jung-Woo Ha,Kyogu Lee

from arxiv, Accepted paper at International Conference on Learning Representations (ICLR) 2019

Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.