亚洲成AV人片乱码色午夜刚交-99久久久无码国产精品69

Diffusion and Poisson flow models have shown impressive performance in a wide range of generative tasks, including low-dose CT image denoising. However, one limitation in general, and for clinical applications in particular, is slow sampling. Due to their iterative nature, the number of function evaluations (NFE) required is usually on the order of $10-10^3$, both for conditional and unconditional generation. In this paper, we present posterior sampling Poisson flow generative models (PPFM), a novel image denoising technique for low-dose and photon-counting CT that produces excellent image quality whilst keeping NFE=1. Updating the training and sampling processes of Poisson flow generative models (PFGM)++, we learn a conditional generator which defines a trajectory between the prior noise distribution and the posterior distribution of interest. We additionally hijack and regularize the sampling process to achieve NFE=1. Our results shed light on the benefits of the PFGM++ framework compared to diffusion models. In addition, PPFM is shown to perform favorably compared to current state-of-the-art diffusion-style models with NFE=1, consistency models, as well as popular deep learning and non-deep learning-based image denoising techniques, on clinical low-dose CT images and clinical images from a prototype photon-counting CT system.

相關內容

圖像降噪

關注 8

圖像降噪是圖像處理中的專業術語。現實中的數字圖像在數字化和傳輸過程中常受到成像設備與外部環境噪聲干擾等影響，稱為含噪圖像或噪聲圖像。減少數字圖像中噪聲的過程稱為圖像降噪，有時候又稱為圖像去噪。

MoDELS · 相似度 · 變換 · 生成模型 · 采樣法 ·

2024 年 2 月 5 日

MobilityGPT: Enhanced Human Mobility Modeling with a GPT model

Ammar Haydari,Dongjie Chen,Zhengfeng Lai,Chen-Nee Chuah

Generative models have shown promising results in capturing human mobility characteristics and generating synthetic trajectories. However, it remains challenging to ensure that the generated geospatial mobility data is semantically realistic, including consistent location sequences, and reflects real-world characteristics, such as constraining on geospatial limits. To address these issues, we reformat human mobility modeling as an autoregressive generation task, leveraging Generative Pre-trained Transformer (GPT). To ensure its controllable generation to alleviate the above challenges, we propose a geospatially-aware generative model, MobilityGPT. We propose a gravity-based sampling method to train a transformer for semantic sequence similarity. Then, we constrained the training process via a road connectivity matrix that provides the connectivity of sequences in trajectory generation, thereby keeping generated trajectories in geospatial limits. Lastly, we constructed a Reinforcement Learning from Trajectory Feedback (RLTF) to minimize the travel distance between training and the synthetically generated trajectories. Our experiments on real-world datasets demonstrate that MobilityGPT outperforms state-of-the-art methods in generating high-quality mobility trajectories that are closest to real data in terms of origin-destination similarity, trip length, travel radius, link, and gravity distributions.

可約的 · 損失 · Performer · DeepFakes · 分解的 ·

2024 年 2 月 5 日

Towards mitigating uncann(eye)ness in face swaps via gaze-centric loss terms

Ethan Wilson,Frederick Shic,Sophie J?rg,Eakta Jain

from arxiv, Accepted to Computers and Graphics Special Issue: Eye Gaze Visualization, Interaction, Synthesis, and Analysis

Advances in face swapping have enabled the automatic generation of highly realistic faces. Yet face swaps are perceived differently than when looking at real faces, with key differences in viewer behavior surrounding the eyes. Face swapping algorithms generally place no emphasis on the eyes, relying on pixel or feature matching losses that consider the entire face to guide the training process. We further investigate viewer perception of face swaps, focusing our analysis on the presence of an uncanny valley effect. We additionally propose a novel loss equation for the training of face swapping models, leveraging a pretrained gaze estimation network to directly improve representation of the eyes. We confirm that viewed face swaps do elicit uncanny responses from viewers. Our proposed improvements significant reduce viewing angle errors between face swaps and their source material. Our method additionally reduces the prevalence of the eyes as a deciding factor when viewers perform deepfake detection tasks. Our findings have implications on face swapping for special effects, as digital avatars, as privacy mechanisms, and more; negative responses from users could limit effectiveness in said applications. Our gaze improvements are a first step towards alleviating negative viewer perceptions via a targeted approach.

損失 · 語音識別 · 離散化 · 掩碼 · 變換 ·

2024 年 2 月 5 日

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Qian Chen,Wen Wang,Qinglin Zhang,Siqi Zheng,Shiliang Zhang,Chong Deng,Yukun Ma,Hai Yu,Jiaqing Liu,Chong Zhang

from arxiv, 5 pages, accepted by ICASSP 2024

Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: //github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld

GM · MoDELS · CASES · 統計量 · 圖 ·

2024 年 2 月 3 日

Graphical models for multivariate extremes

Sebastian Engelke,Manuel Hentschel,Micha?l Lalancette,Frank R?ttger

Graphical models in extremes have emerged as a diverse and quickly expanding research area in extremal dependence modeling. They allow for parsimonious statistical methodology and are particularly suited for enforcing sparsity in high-dimensional problems. In this work, we provide the fundamental concepts of extremal graphical models and discuss recent advances in the field. Different existing perspectives on graphical extremes are presented in a unified way through graphical models for exponent measures. We discuss the important cases of nonparametric extremal graphical models on simple graph structures, and the parametric class of H\"usler--Reiss models on arbitrary undirected graphs. In both cases, we describe model properties, methods for statistical inference on known graph structures, and structure learning algorithms when the graph is unknown. We illustrate different methods in an application to flight delay data at US airports.

MoDELS · 值域 · 相互獨立的 · 同分布的 · DirectShow ·

2024 年 2 月 3 日

Applying different methods to model dry and wet spells at daily scale in a large range of rainfall regimes across Europe

Giorgio Baiamonte,Carmelo Agnese,Carmelo Cammalleri,Elvira Di Nardo,Stefano Ferraris,Tommaso Martini

from arxiv, 28 pages, 11 figures

The modelling of the occurrence of rainfall dry and wet spells (ds and ws, respectively) can be jointly conveyed using the inter-arrival times (it). While the modelling of it has the advantage of requiring a single fitting for the description of all rainfall time characteristics (including wet and dry chains, an extension of the concept of spells), the assumption on the independence and identical distribution of the renewal times it implicitly imposes a memoryless property on the derived ws, which may not be true in some cases. In this study, two different methods for the modelling of rainfall time characteristics at station scale have been applied: i) a direct method (DM) that fits the discrete Lerch distribution to it records, and then derives ws and ds (as well as the corresponding chains) from the it distribution; and ii) an indirect method (IM) that fits the Lerch distribution to the ws and ds records separately, relaxing the assumptions of the renewal process. The results of this application over six stations in Europe, characterized by a wide range of rainfall regimes, highlight how the geometric distribution does not always reasonably reproduce the ws frequencies, even when it are modelled by the Lerch distribution well. Improved performances are obtained with the IM, thanks to the relaxation of the assumption on the independence and identical distribution of the renewal times. A further improvement on the fittings is obtained when the datasets are separated into two periods, suggesting that the inferences may benefit for accounting for the local seasonality.

語音增強 · Performer · Reverberation · Analysis · 噪聲 ·

2024 年 2 月 2 日

Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

Simon Leglaive,Matthieu Fraticelli,Hend ElGhazaly,Léonie Borne,Mostafa Sadeghi,Scott Wisdom,Manuel Pariente,John R. Hershey,Daniel Pressnitzer,Jon P. Barker

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

核化 · GPU · 可約的 · ReQuEST · 情景 ·

2024 年 2 月 1 日

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

Wenqing Wu

from arxiv, 21 pages, 21 figures. Added a timeline figure to demonstrate low priority tasks JCT stability. Updated all multi-tasking experiments with a newer NVIDIA driver version

Highly parallelized workloads like machine learning training, inferences and general HPC tasks are greatly accelerated using GPU devices. In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly demanded since there are always more task requests than the number of GPU available. Existing GPU sharing solutions focus on reducing task-level waiting time or task-level switching costs when multiple jobs competing for a single GPU. Non-stopped computation requests come with different priorities, having non-symmetric impact on QoS for sharing a GPU device. Existing work missed the kernel-level optimization opportunity brought by this setting. To address this problem, we present a novel kernel-level scheduling strategy called FIKIT: Filling Inter-kernel Idle Time. FIKIT incorporates task-level priority information, fine-grained kernel identification, and kernel measurement, allowing low priorities task's execution during high priority task's inter-kernel idle time. Thereby, filling the GPU's device runtime fully, and reduce overall GPU sharing impact to cloud services. Across a set of ML models, the FIKIT based inference system accelerated high priority tasks by 1.32 to 16.41 times compared to the JCT in GPU sharing mode, and more than half of the cases are accelerated by more than 3.4 times. Alternatively, under preemptive sharing, the low-priority tasks have a comparable to default GPU sharing mode JCT, with a 0.86 to 1 times ratio. We further limit the kernel measurement and runtime fine-grained kernel scheduling overhead to less than 5%.

2024 年 2 月 1 日

A goodness-of-fit test for regression models with spatially correlated errors

Andrea Meilán-Vila,Jean D. Opsomer,Mario Francisco-Fernández,Rosa M. Crujeiras

from arxiv, 49 pages, 7 figures

The problem of assessing a parametric regression model in the presence of spatial correlation is addressed in this work. For that purpose, a goodness-of-fit test based on a $L_2$-distance comparing a parametric and a nonparametric regression estimators is proposed. Asymptotic properties of the test statistic, both under the null hypothesis and under local alternatives, are derived. Additionally, a bootstrap procedure is designed to calibrate the test in practice. Finite sample performance of the test is analyzed through a simulation study, and its applicability is illustrated using a real data example.

語音識別 · MoDELS · 變換 · Performer · 語言模型化 ·

2024 年 1 月 31 日

Exploring the limits of decoder-only models trained on public speech recognition corpora

Ankit Gupta,George Saon,Brian Kingsbury

The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.

MoDELS · Processing（編程語言） · Vision · Continuity · HTTPS ·

2023 年 2 月 20 日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Xiao Wang,Guangyao Chen,Guangwu Qian,Pengcheng Gao,Xiao-Yong Wei,Yaowei Wang,Yonghong Tian,Wen Gao

from arxiv, Accepted by Machine Intelligence Research

With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: //github.com/wangxiao5791509/MultiModal_BigModels_Survey