2020久久精品亚洲热综合,日本高清区一区二区三区四区五区,欧美精品久久久久久久宅男

The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available.

相關內容

向量化

關注 1

分離的 · INFORMS · 語音識別 · 端到端 · 可交換的 ·

2023 年 6 月 21 日

Mixture Encoder for Joint Speech Separation and Recognition

Simon Berger,Peter Vieting,Christoph Boeddeker,Ralf Schlüter,Reinhold Haeb-Umbach

from arxiv, Accepted at Interspeech 2023

Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate speakers and recognize each of them with a single-speaker ASR system. End-to-end models process overlapped speech directly in a single, powerful neural network. This work proposes a middle-ground approach that leverages explicit speech separation similarly to the modular approach but also incorporates mixture speech information directly into the ASR module in order to mitigate the propagation of errors made by the speech separator. We also explore a way to exchange cross-speaker context information through a layer that combines information of the individual speakers. Our system is optimized through separate and joint training stages and achieves a relative improvement of 7% in word error rate over a purely modular setup on the SMS-WSJ task.

遷移學習 · Learning · 語音合成 · Performer · 自動語音識別 ·

2023 年 6 月 21 日

Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection

Phat Do,Matt Coler,Jelske Dijkstra,Esther Klabbers

from arxiv, Accepted at the Speech Synthesis Workshop 2023

We compare using a PHOIBLE-based phone mapping method and using phonological features input in transfer learning for TTS in low-resource languages. We use diverse source languages (English, Finnish, Hindi, Japanese, and Russian) and target languages (Bulgarian, Georgian, Kazakh, Swahili, Urdu, and Uzbek) to test the language-independence of the methods and enhance the findings' applicability. We use Character Error Rates from automatic speech recognition and predicted Mean Opinion Scores for evaluation. Results show that both phone mapping and features input improve the output quality and the latter performs better, but these effects also depend on the specific language combination. We also compare the recently-proposed Angular Similarity of Phone Frequencies (ASPF) with a family tree-based distance measure as a criterion to select source languages in transfer learning. ASPF proves effective if label-based phone input is used, while the language distance does not have expected effects.

NeRF · Guidance · Continuity · Microsoft Surface · 控制器 ·

2023 年 6 月 20 日

NeRF synthesis with shading guidance

Chenbin Li,Yu Xin,Gaoyi Liu,Xiang Zeng,Ligang Liu

from arxiv, 16 pages, 16 figures, accepted by CAD/Graphics 2023(poster)

The emerging Neural Radiance Field (NeRF) shows great potential in representing 3D scenes, which can render photo-realistic images from novel view with only sparse views given. However, utilizing NeRF to reconstruct real-world scenes requires images from different viewpoints, which limits its practical application. This problem can be even more pronounced for large scenes. In this paper, we introduce a new task called NeRF synthesis that utilizes the structural content of a NeRF patch exemplar to construct a new radiance field of large size. We propose a two-phase method for synthesizing new scenes that are continuous in geometry and appearance. We also propose a boundary constraint method to synthesize scenes of arbitrary size without artifacts. Specifically, we control the lighting effects of synthesized scenes using shading guidance instead of decoupling the scene. We have demonstrated that our method can generate high-quality results with consistent geometry and appearance, even for scenes with complex lighting. We can also synthesize new scenes on curved surface with arbitrary lighting effects, which enhances the practicality of our proposed NeRF synthesis approach.

2023 年 6 月 20 日

EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model

Lianying Yin,Yijun Wang,Tianyu He,Jinming Liu,Wei Zhao,Bohan Li,Xin Jin,Jianxin Lin

from arxiv, under review

Although previous co-speech gesture generation methods are able to synthesize motions in line with speech content, it is still not enough to handle diverse and complicated motion distribution. The key challenges are: 1) the one-to-many nature between the speech content and gestures; 2) the correlation modeling between the body joints. In this paper, we present a novel framework (EMoG) to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotion clues to guide the generation process, making the generation much easier; 2) To model joint correlation, we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling. Then, the two sub-problems are explicitly tackled with our proposed Joint Correlation-aware transFormer (JCFormer). Through extensive evaluations, we demonstrate that our proposed method surpasses previous state-of-the-art approaches, offering substantial superiority in gesture synthesis.

估計/估計量 · 展開 · Extensibility · SOTA · MoDELS ·

2023 年 6 月 20 日

Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging

Siming Zheng,Xin Yuan

We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement. The underlying principle of reconstructing multi-frame images from only one single measurement is to solve an ill-posed problem. By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems. In this paper, our proposed model is under the DUN framework and we propose a 3D Convolution-Transformer Mixture (CTM) module with a 3D efficient and scalable attention model plugged in, which helps fully learn the correlation between temporal and spatial dimensions by virtue of Transformer. To our best knowledge, this is the first time that Transformer is employed to video SCI reconstruction. Besides, to further investigate the high-frequency information during the reconstruction process which are neglected in previous studies, we introduce variance estimation characterizing the uncertainty on a pixel-by-pixel basis. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) (with a 1.2dB gain in PSNR over previous SOTA algorithm) results. We will release the code.

圖 · MoDELS · Networking · 生成模型 · Processing（編程語言） ·

2023 年 6 月 19 日

Using Motif Transitions for Temporal Graph Generation

Penghang Liu,A. Erdem Sar?yüce

from arxiv, Accepted by 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD2023)

Graph generative models are highly important for sharing surrogate data and benchmarking purposes. Real-world complex systems often exhibit dynamic nature, where the interactions among nodes change over time in the form of a temporal network. Most temporal network generation models extend the static graph generation models by incorporating temporality in the generation process. More recently, temporal motifs are used to generate temporal networks with better success. However, existing models are often restricted to a small set of predefined motif patterns due to the high computational cost of counting temporal motifs. In this work, we develop a practical temporal graph generator, Motif Transition Model (MTM), to generate synthetic temporal networks with realistic global and local features. Our key idea is modeling the arrival of new events as temporal motif transition processes. We first calculate the transition properties from the input graph and then simulate the motif transition processes based on the transition probabilities and transition rates. We demonstrate that our model consistently outperforms the baselines with respect to preserving various global and local temporal graph statistics and runtime performance.

前向 · 可辨認的 · binary · CASES · state-of-the-art ·

2023 年 6 月 19 日

Forward LTLf Synthesis: DPLL At Work

Marco Favorito

This paper proposes a new AND-OR graph search framework for synthesis of Linear Temporal Logic on finite traces (\LTLf), that overcomes some limitations of previous approaches. Within such framework, we devise a procedure inspired by the Davis-Putnam-Logemann-Loveland (DPLL) algorithm to generate the next available agent-environment moves in a truly depth-first fashion, possibly avoiding exhaustive enumeration or costly compilations. We also propose a novel equivalence check for search nodes based on syntactic equivalence of state formulas. Since the resulting procedure is not guaranteed to terminate, we identify a stopping condition to abort execution and restart the search with state-equivalence checking based on Binary Decision Diagrams (BDD), which we show to be correct. The experimental results show that in many cases the proposed techniques outperform other state-of-the-art approaches. Our implementation Nike competed in the LTLf Realizability Track in the 2023 edition of SYNTCOMP, and won the competition.

估計/估計量 · 3D · 變換 · MoDELS · 知識 (knowledge) ·

2023 年 6 月 16 日

EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors

Yaqi Zhang,Yan Lu,Bin Liu,Zhiwei Zhao,Qi Chu,Nenghai Yu

from arxiv, 5 pages, 2 figures, 4 tables, published in the proceedings of IEEE ICASSP 2023

Transformer is popular in recent 3D human pose estimation, which utilizes long-term modeling to lift 2D keypoints into the 3D space. However, current transformer-based methods do not fully exploit the prior knowledge of the human skeleton provided by the kinematic structure. In this paper, we propose a novel transformer-based model EvoPose to introduce the human body prior knowledge for 3D human pose estimation effectively. Specifically, a Structural Priors Representation (SPR) module represents human priors as structural features carrying rich body patterns, e.g. joint relationships. The structural features are interacted with 2D pose sequences and help the model to achieve more informative spatiotemporal features. Moreover, a Recursive Refinement (RR) module is applied to refine the 3D pose outputs by utilizing estimated results and further injects human priors simultaneously. Extensive experiments demonstrate the effectiveness of EvoPose which achieves a new state of the art on two most popular benchmarks, Human3.6M and MPI-INF-3DHP.

Integration · 去噪 · HTTPS · 樣例 · 語音合成 ·

2023 年 6 月 15 日

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Shivam Mehta,Siyang Wang,Simon Alexanderson,Jonas Beskow,éva Székely,Gustav Eje Henter

from arxiv, 7 pages, 2 figures, Accepted at Interspeech Speech Synthesis Workshop (SSW) 2023

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. For synthesised examples please see //shivammehta25.github.io/Diff-TTSG

條件隨機場 · 隨機場 · INFORMS · 圖像分割 · 卷積神經網絡 ·

2017 年 12 月 27 日

Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation

Fahim Irfan Alam,Jun Zhou,Alan Wee-Chung Liew,Xiuping Jia,Jocelyn Chanussot,Yongsheng Gao

from arxiv, Submitted for Journal (Version 2)

Image segmentation is considered to be one of the critical tasks in hyperspectral remote sensing image processing. Recently, convolutional neural network (CNN) has established itself as a powerful model in segmentation and classification by demonstrating excellent performances. The use of a graphical model such as a conditional random field (CRF) contributes further in capturing contextual information and thus improving the segmentation performance. In this paper, we propose a method to segment hyperspectral images by considering both spectral and spatial information via a combined framework consisting of CNN and CRF. We use multiple spectral cubes to learn deep features using CNN, and then formulate deep CRF with CNN-based unary and pairwise potential functions to effectively extract the semantic correlations between patches consisting of three-dimensional data cubes. Effective piecewise training is applied in order to avoid the computationally expensive iterative CRF inference. Furthermore, we introduce a deep deconvolution network that improves the segmentation masks. We also introduce a new dataset and experimented our proposed method on it along with several widely adopted benchmark datasets to evaluate the effectiveness of our method. By comparing our results with those from several state-of-the-art models, we show the promising potential of our method.