亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets.

相關內容

特征(zheng)選(xuan)擇( Feature Selection )也稱(cheng)特征(zheng)子集(ji)選(xuan)擇( Feature Subset Selection , FSS ),或屬性(xing)選(xuan)擇( Attribute Selection )。是(shi)(shi)指(zhi)從(cong)已有的M個特征(zheng)(Feature)中(zhong)選(xuan)擇N個特征(zheng)使(shi)得系統的特定指(zhi)標最優化(hua),是(shi)(shi)從(cong)原(yuan)始特征(zheng)中(zhong)選(xuan)擇出一(yi)(yi)些最有效特征(zheng)以降低(di)數(shu)據集(ji)維(wei)度的過程,是(shi)(shi)提高(gao)學(xue)習算法性(xing)能的一(yi)(yi)個重要手段,也是(shi)(shi)模式識別(bie)中(zhong)關鍵(jian)的數(shu)據預處理步(bu)驟(zou)。對(dui)于一(yi)(yi)個學(xue)習算法來說,好的學(xue)習樣本是(shi)(shi)訓練模型的關鍵(jian)。

Deep learning algorithms for predicting neuroimaging data have shown considerable promise in various applications. Prior work has demonstrated that deep learning models that take advantage of the data's 3D structure can outperform standard machine learning on several learning tasks. However, most prior research in this area has focused on neuroimaging data from adults. Within the Adolescent Brain and Cognitive Development (ABCD) dataset, a large longitudinal development study, we examine structural MRI data to predict gender and identify gender-related changes in brain structure. Results demonstrate that gender prediction accuracy is exceptionally high (>97%) with training epochs >200 and that this accuracy increases with age. Brain regions identified as the most discriminative in the task under study include predominantly frontal areas and the temporal lobe. When evaluating gender predictive changes specific to a two-year increase in age, a broader set of visual, cingulate, and insular regions are revealed. Our findings show a robust gender-related structural brain change pattern, even over a small age range. This suggests that it might be possible to study how the brain changes during adolescence by looking at how these changes are related to different behavioral and environmental factors.

Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future.

Recent advances in technology for hyper-realistic visual effects provoke the concern that deepfake videos of political speeches will soon be visually indistinguishable from authentic video recordings. The conventional wisdom in communications research predicts people will fall for fake news more often when the same version of a story is presented as a video rather than text. Here, we evaluate how accurately 41,822 participants distinguish real political speeches from fabrications in an experiment where speeches are randomized to appear as permutations of text, audio, and video. We find access to audio and visual communication modalities improve participants' accuracy. Here, human judgment relies more on how something is said, the audio-visual cues, than what is said, the speech content. However, we find that reflective reasoning moderates the degree to which participants consider visual information: low performance on the Cognitive Reflection Test is associated with an over-reliance on what is said.

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 34.9% WER on YTDEV18 and 19.3% on LRS3-TED, a 10% and 9% relative improvements over our convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER). In addition, in a series of experiments on multi-person AV-ASR, we obtained an average relative reduction of 2% WER over our convolutional video frontend.

The statistical regularities of natural images, referred to as natural scene statistics, play an important role in no-reference image quality assessment. However, it has been widely acknowledged that screen content images (SCIs), which are typically computer generated, do not hold such statistics. Here we make the first attempt to learn the statistics of SCIs, based upon which the quality of SCIs can be effectively determined. The underlying mechanism of the proposed approach is based upon the wild assumption that the SCIs, which are not physically acquired, still obey certain statistics that could be understood in a learning fashion. We empirically show that the statistics deviation could be effectively leveraged in quality assessment, and the proposed method is superior when evaluated in different settings. Extensive experimental results demonstrate the Deep Feature Statistics based SCI Quality Assessment (DFSS-IQA) model delivers promising performance compared with existing NR-IQA models and shows a high generalization capability in the cross-dataset settings. The implementation of our method is publicly available at //github.com/Baoliang93/DFSS-IQA.

Deep neural network (DNN) classifiers are vulnerable to adversarial attacks. Although the existing gradient-based attacks have achieved good performance in feed-forward model and image recognition tasks, the extension for time series classification in the recurrent neural network (RNN) remains a dilemma, because the cyclical structure of RNN prevents direct model differentiation and the visual sensitivity to perturbations of time series data challenges the traditional local optimization objective to minimize perturbation. In this paper, an efficient and widely applicable approach called TSFool for crafting high-quality adversarial time series for the RNN classifier is proposed. We propose a novel global optimization objective named Camouflage Coefficient to consider how well the adversarial samples hide in class clusters, and accordingly redefine the high-quality adversarial attack as a multi-objective optimization problem. We also propose a new idea to use intervalized weighted finite automata (IWFA) to capture deeply embedded vulnerable samples having otherness between features and latent manifold to guide the approximation to the optimization solution. Experiments on 22 UCR datasets are conducted to confirm that TSFool is a widely effective, efficient and high-quality approach with 93.22% less local perturbation, 32.33% better global camouflage, and 1.12 times speedup to existing methods.

In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract. This task provides a promising direction for speech synthesis research, as the articulatory space is compact, smooth, and interpretable. Current works have highlighted the potential for deep learning models to perform articulatory synthesis. However, it remains unclear whether these models can achieve the efficiency and fidelity of the human speech production system. To help bridge this gap, we propose a time-domain articulatory synthesis methodology and demonstrate its efficacy with both electromagnetic articulography (EMA) and synthetic articulatory feature inputs. Our model is computationally efficient and achieves a transcription word error rate (WER) of 18.5% for the EMA-to-speech task, yielding an improvement of 11.6% compared to prior work. Through interpolation experiments, we also highlight the generalizability and interpretability of our approach.

Modern synthetic data generators consist of model-based methods where the focus is primarily on tuning the parameters of the model and not on specifying the structure of the data itself. Scagnostics is an exploratory graphical method, capable of encapsulating the structure of bivariate data through graph-theoretic measures. An inverse scagnostic measure would therefore provide an entry point to generate datasets based on the characteristics of instance space rather than a model-based simulation approach. scatteR is a novel data generation method with controllable characteristics based on scagnostic measurements. We have used a Generalized Simulated Annealing optimizer iteratively to discover the optimal arrangement of data points in each iteration that minimizes the distance between the current and target measurements. Generally, as a pedagogical tool, scatteR can be used to generate datasets to teach statistical methods. Based on the results of this study, scatteR is capable of generating 50 data points in under 30 seconds with a 0.05 Root Mean Squared Error on average.

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

Deep learning techniques have received much attention in the area of image denoising. However, there are substantial differences in the various types of deep learning methods dealing with image denoising. Specifically, discriminative learning based on deep learning can ably address the issue of Gaussian noise. Optimization models based on deep learning are effective in estimating the real noise. However, there has thus far been little related research to summarize the different deep learning techniques for image denoising. In this paper, we offer a comparative study of deep techniques in image denoising. We first classify the deep convolutional neural networks (CNNs) for additive white noisy images; the deep CNNs for real noisy images; the deep CNNs for blind denoising and the deep CNNs for hybrid noisy images, which represents the combination of noisy, blurred and low-resolution images. Then, we analyze the motivations and principles of the different types of deep learning methods. Next, we compare the state-of-the-art methods on public denoising datasets in terms of quantitative and qualitative analysis. Finally, we point out some potential challenges and directions of future research.

北京阿比特科技有限公司