女生喊疼男生越往里寨的免费观看,18GAY国产小鲜肉可播放,亚洲人成电影青青在线播放

WeiHsiang Liao,Yuhta Takida,Yukara Ikemiya,Zhi Zhong,Chieh-Hsin Lai,Giorgio Fabbro,Kazuki Shimada,Keisuke Toyama,Kinwai Cheuk,Marco Martinez,Shusuke Takahashi,Stefan Uhlich,Taketo Akama,Woosung Choi,Yuichiro Koyama,Yuki Mitsufuji

from arxiv, 41 pages with 14 figures

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo , a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · 在線 · MoDELS · entity · 實體解析 ·

2024 年 12 月 16 日

Leveraging User-Generated Metadata of Online Videos for Cover Song Identification

Simon Hachmeier,Robert J?schke

from arxiv, accepted for presentation at NLP for Music and Audio (NLP4MusA) 2024

YouTube is a rich source of cover songs. Since the platform itself is organized in terms of videos rather than songs, the retrieval of covers is not trivial. The field of cover song identification addresses this problem and provides approaches that usually rely on audio content. However, including the user-generated video metadata available on YouTube promises improved identification results. In this paper, we propose a multi-modal approach for cover song identification on online video platforms. We combine the entity resolution models with audio-based approaches using a ranking model. Our findings implicate that leveraging user-generated metadata can stabilize cover song identification performance on YouTube.

MoDELS · 損失 · 估計/估計量 · Weight · 模型評估 ·

2024 年 12 月 16 日

Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation

Dabbrata Das,Argho Deb Das,Farhan Sadaf

Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture, where the Inception-ResNet-v2 model serves as the encoder. This is the first instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrating improved performance over previous models. Our model effectively captures complex objects and fine-grained details, which are generally difficult to predict. Additionally, it incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. We propose a composite loss function comprising depth loss, gradient edge loss, and Structural Similarity Index Measure (SSIM) loss, with fine-tuned weights to optimize the weighted sum, ensuring a balance across different aspects of depth estimation. Experimental results on the NYU Depth V2 dataset show that our model achieves state-of-the-art performance, with an Absolute Relative Error (ARE) of 0.064, Root Mean Square Error (RMSE) of 0.228, and accuracy ($\delta$ < 1.25) of 89.3%. These metrics demonstrate that our model can accurately predict depth even in challenging scenarios, providing a scalable solution for real-world applications in robotics, 3D reconstruction, and augmented reality.

稀疏 · 估計/估計量 · 超參數 · Learning · 有向 ·

2024 年 12 月 14 日

Enhancing Off-Grid One-Bit DOA Estimation with Learning-Based Sparse Bayesian Approach for Non-Uniform Sparse Array

Yunqiao Hu,Shunqiao Sun,Yimin D. Zhang

from arxiv, Proc. 58th Annual Asilomar Conference on Signals, Systems, and Computers (Asilomar), Pacific Grove, CA, Oct. 27 - Oct. 30, 2024

This paper tackles the challenge of one-bit off-grid direction of arrival (DOA) estimation in a single snapshot scenario based on a learning-based Bayesian approach. Firstly, we formulate the off-grid DOA estimation model, utilizing the first-order off-grid approximation, incorporating one-bit data quantization. Subsequently, we address this problem using the Sparse Bayesian based framework and solve iteratively. However, traditional Sparse Bayesian methods often face challenges such as high computational complexity and the need for extensive hyperparameter tuning. To balance estimation accuracy and computational efficiency, we propose a novel Learning-based Sparse Bayesian framework, which leverages an unrolled neural network architecture. This framework autonomously learns hyperparameters through supervised learning, offering more accurate off-grid DOA estimates and improved computational efficiency compared to some state-of-the-art methods. Furthermore, the proposed approach is applicable to both uniform linear arrays and non-uniform sparse arrays. Simulation results validate the effectiveness of the proposed framework.

MoDELS · 模態 · 有偏 · AIM · 可約的 ·

2024 年 12 月 14 日

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Saksham Singh Kushwaha,Yapeng Tian

Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state-of-the-art results on the VGGSound benchmark. Our source code and pre-trained models will be released. Demo is available at: //www.youtube.com/watch?v=QmqWhUjPkJI.

MoDELS · 值域 · Processing（編程語言） · Pair · 控制器 ·

2024 年 12 月 14 日

EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Aashish Rai,Srinath Sridhar

from arxiv, WACV 2025

We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strengths of latent diffusion models for conditioned audio synthesis. We first encode and process paired audio-video data to make them suitable for generation. The encoded data is then used to train a model that can generate an audio track that captures the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables generation of temporally synchronized audio. Extensive evaluations and a comprehensive user study show that our model outperforms existing work in audio quality, and in our proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.

3D · Processing（編程語言） · Learning · 塑造 · 損失 ·

2024 年 12 月 13 日

Coherent 3D Scene Diffusion From a Single RGB Image

Manuel Dahnert,Angela Dai,Norman Müller,Matthias Nie?ner

from arxiv, Project Page: //www.manuel-dahnert.com/research/scene-diffusion - Accepted at NeurIPS 2024

We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.

模型評估 · Networking · 類別 · 可穿戴設備 · Branch ·

2024 年 12 月 13 日

Multi-Feature Fusion and Compressed Bi-LSTM for Memory-Efficient Heartbeat Classification on Wearable Devices

Reza Nikandish,Jiayu He,Benyamin Haghi

In this article, we present a resource-efficient approach for electrocardiogram (ECG) based heartbeat classification using multi-feature fusion and bidirectional long short-term memory (Bi-LSTM). The dataset comprises five original classes from the MIT-BIH Arrhythmia Database: Normal (N), Left Bundle Branch Block (LBBB), Right Bundle Branch Block (RBBB), Premature Ventricular Contraction (PVC), and Paced Beat (PB). Preprocessing methods including the discrete wavelet transform and dual moving average windows are used to reduce noise and artifacts in the raw ECG signal, and extract the main points (PQRST) of the ECG waveform. Multi-feature fusion is achieved by utilizing time intervals and the proposed under-the-curve areas, which are inherently robust against noise, as input features. Simulations demonstrated that incorporating under-the-curve area features improved the classification accuracy for the challenging RBBB and LBBB classes from 31.4\% to 84.3\% for RBBB, and from 69.6\% to 87.0\% for LBBB. Using a Bi-LSTM network, rather than a conventional LSTM network, resulted in higher accuracy (33.8\% vs 21.8\%) with a 28\% reduction in required network parameters for the RBBB class. Multiple neural network models with varying parameter sizes, including tiny (84k), small (150k), medium (478k), and large (1.25M) models, are developed to achieve high accuracy \textit{across all classes}, a more crucial and challenging goal than overall classification accuracy.

相似度 · 可辨認的 · GROUP · MoDELS · 類別 ·

2024 年 12 月 12 日

Synthetic Potential Outcomes and Causal Mixture Identifiability

Bijan Mazaheri,Chandler Squires,Caroline Uhler

Heterogeneous data from multiple populations, sub-groups, or sources is often represented as a ``mixture model'' with a single latent class influencing all of the observed covariates. Heterogeneity can be resolved at multiple levels by grouping populations according to different notions of similarity. This paper proposes grouping with respect to the causal response of an intervention or perturbation on the system. This definition is distinct from previous notions, such as similar covariate values (e.g. clustering) or similar correlations between covariates (e.g. Gaussian mixture models). To solve the problem, we ``synthetically sample'' from a counterfactual distribution using higher-order multi-linear moments of the observable data. To understand how these ``causal mixtures'' fit in with more classical notions, we develop a hierarchy of mixture identifiability.

Weight · 估計/估計量 · 平滑 · CASE · 子采樣 ·

2024 年 12 月 12 日

Balancing Weights for Non-monotone Missing Data

Jianing Dong,Raymond K. W. Wong,Kwun Chuen Gary Chan

Balancing weights have been widely applied to single or monotone missingness due to empirical advantages over likelihood-based methods and inverse probability weighting approaches. This paper considers non-monotone missing data under the complete-case missing variable condition (CCMV), a case of missing not at random (MNAR). Using relationships between each missing pattern and the complete-case subsample, we construct a weighted estimator for estimation, where the weight is a sum of ratios of the conditional probability of observing a particular missing pattern versus that of observing the complete-case, given the variables observed in the corresponding missing pattern. However, plug-in estimators of the propensity odds can be unbounded and lead to unstable estimation. Using further relations between propensity odds and balancing of moments across response patterns, we employ tailored loss functions, each encouraging empirical balance across patterns to estimate propensity odds flexibly using a functional basis expansion. We propose two penalizations to control propensity odds model smoothness and empirical imbalance. We study the asymptotic properties of the proposed estimators and show that they are consistent under mild smoothness assumptions. Asymptotic normality and efficiency are developed. Simulation results show the superior performance of the proposed method.

知識 (knowledge) · Integration · 演繹推理 · 原點 · WEB ·

2024 年 12 月 11 日

VEL: A Formally Verified Reasoner for OWL2 EL Profile

Atalay Mert Ileri,Nalen Rangarajan,Jack Cannell,Hande McGinty

Over the past two decades, the Web Ontology Language (OWL) has been instrumental in advancing the development of ontologies and knowledge graphs, providing a structured framework that enhances the semantic integration of data. However, the reliability of deductive reasoning within these systems remains challenging, as evidenced by inconsistencies among popular reasoners in recent competitions. This evidence underscores the limitations of current testing-based methodologies, particularly in high-stakes domains such as healthcare. To mitigate these issues, in this paper, we have developed VEL, a formally verified EL++ reasoner equipped with machine-checkable correctness proofs that ensure the validity of outputs across all possible inputs. This formalization, based on the algorithm of Baader et al., has been transformed into executable OCaml code using the Coq proof assistant's extraction capabilities. Our formalization revealed several errors in the original completeness proofs, which led to changes to the algorithm to ensure its completeness. Our work demonstrates the necessity of mechanization of reasoning algorithms to ensure their correctness at theoretical and implementation levels.