2021精品一级毛片一区二区,国产日韩精品在线观看,午夜免费视频又黄又刺激,免费大片黄在线观看18片,最新在线精品国产2022

Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.

相關內容

穩健性

關注 3

Ray · TOOLS · 通道 · 跡 · state-of-the-art ·

2024 年 12 月 13 日

Comparing Differentiable and Dynamic Ray Tracing: Introducing the Multipath Lifetime Map

Jérome Eertmans,Enrico Maria Vittuci,Vittorio Degli-Esposti,Laurent Jacques,Claude Oestges

from arxiv, 5 pages, 5 figures, 1 table, accepted at EuCAP 2025

With the increasing presence of dynamic scenarios, such as Vehicle-to-Vehicle communications, radio propagation modeling tools must adapt to the rapidly changing nature of the radio channel. Recently, both Differentiable and Dynamic Ray Tracing frameworks have emerged to address these challenges. However, there is often confusion about how these approaches differ and which one should be used in specific contexts. In this paper, we provide an overview of these two techniques and a comparative analysis against two state-of-the-art tools: 3DSCAT from UniBo and Sionna from NVIDIA. To provide a more precise characterization of the scope of these methods, we introduce a novel simulation-based metric, the Multipath Lifetime Map, which enables the evaluation of spatial and temporal coherence in radio channels only based on the geometrical description of the environment. Finally, our metrics are evaluated on a classic urban street canyon scenario, yielding similar results to those obtained from measurement campaigns.

控制器 · Learning · 設計 · SimPLe · 噪聲 ·

2024 年 12 月 13 日

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

Sonal Kumar,Prem Seetharaman,Justin Salamon,Dinesh Manocha,Oriol Nieto

from arxiv, Website: //sonalkum.github.io/SILA/

The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative applications in sound design and content creation. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio semantics and its acoustic features. Our approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our objective and subjective evaluation results demonstrate the effectiveness of our approach in producing high-quality, customizable audio outputs that align closely with user specifications.

MoDELS · Automator · 可辨認的 · Extensibility · 未標記 ·

2024 年 12 月 12 日

MaskTerial: A Foundation Model for Automated 2D Material Flake Detection

Jan-Lucas Uslu,Alexey Nekrasov,Alexander Hermans,Bernd Beschoten,Bastian Leibe,Lutz Waldecker,Christoph Stampfer

from arxiv, 9 pages, 5 figures

The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator, that generates realistic microscopy images from unlabeled data. This results in a model that can to quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.

穩健性 · state-of-the-art · MoDELS · 相關系數 · Integration ·

2024 年 12 月 11 日

ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection

Tiago Roxo,Joana C. Costa,Pedro Inácio,Hugo Proen?a

State-of-the-art Active Speaker Detection (ASD) approaches mainly use audio and facial features as input. However, the main hypothesis in this paper is that body dynamics is also highly correlated to "speaking" (and "listening") actions and should be particularly useful in wild conditions (e.g., surveillance settings), where face cannot be reliably accessed. We propose ASDnB, a model that singularly integrates face with body information by merging the inputs at different steps of feature extraction. Our approach splits 3D convolution into 2D and 1D to reduce computation cost without loss of performance, and is trained with adaptive weight feature importance for improved complement of face with body data. Our experiments show that ASDnB achieves state-of-the-art results in the benchmark dataset (AVA-ActiveSpeaker), in the challenging data of WASD, and in cross-domain settings using Columbia. This way, ASDnB can perform in multiple settings, which is positively regarded as a strong baseline for robust ASD models (code available at //github.com/Tiago-Roxo/ASDnB).

控制器 · Prompt · 線性的 · 層 · 中位數 ·

2024 年 12 月 11 日

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

Hugo Flores García,Oriol Nieto,Justin Salamon,Bryan Pardo,Prem Seetharaman

We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation. Sound examples are available at //hugofloresgarcia.art/sketch2sound/.

點云 · 3D · 縮放 · Performer · 評論員 ·

2024 年 12 月 11 日

PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis

Yifan Xie,Tao Feng,Xin Zhang,Xiangyang Luo,Zixuan Guo,Weijiang Yu,Heng Chang,Fei Ma,Fei Richard Yu

from arxiv, 9 pages, accepted by AAAI 2025

Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.

3D · 表示 · MoDELS · 值域 · Microsoft Surface ·

2024 年 12 月 11 日

LineGS : 3D Line Segment Representation on 3D Gaussian Splatting

Chenggang Yang,Yuang Shi,Wei Tsang Ooi

Abstract representations of 3D scenes play a crucial role in computer vision, enabling a wide range of applications such as mapping, localization, surface reconstruction, and even advanced tasks like SLAM and rendering. Among these representations, line segments are widely used because of their ability to succinctly capture the structural features of a scene. However, existing 3D reconstruction methods often face significant challenges. Methods relying on 2D projections suffer from instability caused by errors in multi-view matching and occlusions, while direct 3D approaches are hampered by noise and sparsity in 3D point cloud data. This paper introduces LineGS, a novel method that combines geometry-guided 3D line reconstruction with a 3D Gaussian splatting model to address these challenges and improve representation ability. The method leverages the high-density Gaussian point distributions along the edge of the scene to refine and optimize initial line segments generated from traditional geometric approaches. By aligning these segments with the underlying geometric features of the scene, LineGS achieves a more precise and reliable representation of 3D structures. The results show significant improvements in both geometric accuracy and model compactness compared to baseline methods.

FAST · Networking · 地球 · 可約的 · 優化器 ·

2024 年 12 月 11 日

Fast Beam Placement for Ultra-Dense LEO Networks

Trinh Van Chien,Nguyen Minh Quan,Tri Nhu Do,Cuong Le,Tan N. Nguyen,Symeon Chatzinotas

from arxiv, 5 pages, 3 figures. Accepted by IEEE WCL

Low Earth orbit (LEO) satellites has brought about significant improvements in wireless communications, characterized by low latency and reduced transmission loss compared to geostationary orbit (GSO) satellites. Ultra-dense LEO satellites can serve many users by generating active beams effective to their locations. The beam placement problem is challenging but important for efficiently allocating resources with a large number of users. This paper formulates and solves a fast beam placement optimization problem for ultra-dense satellite systems to enhance the link budget with a minimum number of active beams (NABs). To achieve this goal and balance load among beams within polynomial time, we propose two algorithms for large user groups exploiting the modified K-means clustering and the graph theory. Numerical results illustrate the effectiveness of the proposals in terms of the statistical channel gain-to-noise ratio and computation time over state-of-the-art benchmarks.

數據集 · GROUP · Elevate · 評論員 · 生物特征識別 ·

2022 年 11 月 3 日

Expanding Accurate Person Recognition to New Altitudes and Ranges: The BRIAR Dataset

David Cornett III,Joel Brogan,Nell Barber,Deniz Aykac,Seth Baird,Nick Burchfield,Carl Dukes,Andrew Duncan,Regina Ferrell,Jim Goddard,Gavin Jager,Matt Larson,Bart Murphy,Christi Johnson,Ian Shelley,Nisha Srinivas,Brandon Stockwell,Leanne Thompson,Matt Yohe,Robert Zhang,Scott Dolvin,Hector J. Santos-Villalobos,David S. Bolme

Face recognition technology has advanced significantly in recent years due largely to the availability of large and increasingly complex training datasets for use in deep learning models. These datasets, however, typically comprise images scraped from news sites or social media platforms and, therefore, have limited utility in more advanced security, forensics, and military applications. These applications require lower resolution, longer ranges, and elevated viewpoints. To meet these critical needs, we collected and curated the first and second subsets of a large multi-modal biometric dataset designed for use in the research and development (R&D) of biometric recognition technologies under extremely challenging conditions. Thus far, the dataset includes more than 350,000 still images and over 1,300 hours of video footage of approximately 1,000 subjects. To collect this data, we used Nikon DSLR cameras, a variety of commercial surveillance cameras, specialized long-rage R&D cameras, and Group 1 and Group 2 UAV platforms. The goal is to support the development of algorithms capable of accurately recognizing people at ranges up to 1,000 m and from high angles of elevation. These advances will include improvements to the state of the art in face recognition and will support new research in the area of whole-body recognition using methods based on gait and anthropometry. This paper describes methods used to collect and curate the dataset, and the dataset's characteristics at the current stage.

樣例 · 相似度 · 語音識別 · 端到端 · 轉錄 ·

2018 年 1 月 5 日

Audio Adversarial Examples: Targeted Attacks on Speech-to-Text

Nicholas Carlini,David Wagner

We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second). We apply our iterative optimization-based attack to Mozilla's implementation DeepSpeech end-to-end, and show it has a 100% success rate. The feasibility of this attack introduce a new domain to study adversarial examples.