We study the problem of stereo singing voice cancellation, a subtask of music source separation, whose goal is to estimate an instrumental background from a stereo mix. We explore how to achieve performance similar to large state-of-the-art source separation networks starting from a small, efficient model for real-time speech separation. Such a model is useful when memory and compute are limited and singing voice processing has to run with limited look-ahead. In practice, this is realised by adapting an existing mono model to handle stereo input. Improvements in quality are obtained by tuning model parameters and expanding the training set. Moreover, we highlight the benefits a stereo model brings by introducing a new metric which detects attenuation inconsistencies between channels. Our approach is evaluated using objective offline metrics and a large-scale MUSHRA trial, confirming the effectiveness of our techniques in stringent listening tests.
A significant part of modern topological data analysis is concerned with the design and study of algebraic invariants of poset representations -- often referred to as multi-parameter persistence modules. One such invariant is the minimal rank decomposition, which encodes the ranks of all the structure morphisms of the persistence module by a single ordered pair of rectangle-decomposable modules, interpreted as a signed barcode. This signed barcode generalizes the concept of persistence barcode from one-parameter persistence to any number of parameters, raising the question of its bottleneck stability. We show in this paper that the minimal rank decomposition is not stable under the natural notion of signed bottleneck matching between signed barcodes. We remedy this by turning our focus to the rank exact decomposition, a related signed barcode induced by the minimal projective resolution of the module relative to the so-called rank exact structure, which we prove to be bottleneck stable under signed matchings. As part of our proof, we obtain two intermediate results of independent interest: we compute the global dimension of the rank exact structure on the category of finitely presentable multi-parameter persistence modules, and we prove a bottleneck stability result for hook-decomposable modules. We also give a bound for the size of the rank exact decomposition that is polynomial in the size of the usual minimal projective resolution, we prove a universality result for the dissimilarity function induced by the notion of signed matching, and we compute, in the two-parameter case, the global dimension of a different exact structure related to the upsets of the indexing poset. This set of results combines concepts from topological data analysis and from the representation theory of posets, and we believe is relevant to both areas.
As the development of formal proofs is a time-consuming task, it is important to devise ways of sharing the already written proofs to prevent wasting time redoing them. One of the challenges in this domain is to translate proofs written in proof assistants based on impredicative logics to proof assistants based on predicative logics, whenever impredicativity is not used in an essential way. In this paper we present a transformation for sharing proofs with a core predicative system supporting prenex universe polymorphism (like in Agda). It consists in trying to elaborate each term into a predicative universe polymorphic term as general as possible. The use of universe polymorphism is justified by the fact that mapping each universe to a fixed one in the target theory is not sufficient in most cases. During the elaboration, we need to solve unification problems in the equational theory of universe levels. In order to do this, we give a complete characterization of when a single equation admits a most general unifier. This characterization is then employed in a partial algorithm which uses a constraint-postponement strategy for trying to solve unification problems. The proposed translation is of course partial, but in practice allows one to translate many proofs that do not use impredicativity in an essential way. Indeed, it was implemented in the tool Predicativize and then used to translate semi-automatically many non-trivial developments from Matita's library to Agda, including proofs of Bertrand's Postulate and Fermat's Little Theorem, which (as far as we know) were not available in Agda yet.
We show that the first-order theory of Sturmian words over Presburger arithmetic is decidable. Using a general adder recognizing addition in Ostrowski numeration systems by Baranwal, Schaeffer and Shallit, we prove that the first-order expansions of Presburger arithmetic by a single Sturmian word are uniformly $\omega$-automatic, and then deduce the decidability of the theory of the class of such structures. Using an implementation of this decision algorithm called Pecan, we automatically reprove classical theorems about Sturmian words in seconds, and are able to obtain new results about antisquares and antipalindromes in characteristic Sturmian words.
The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is shown that the components of the population version of the estimator correspond to the well separated components of $P$. This provides some theoretical justification for the use of such estimators for cluster analysis in case that $P$ has well separated subpopulations even if these subpopulations differ from what the mixture model assumes.
Emotion datasets used for Speech Emotion Recognition (SER) often contain acted or elicited speech, limiting their applicability in real-world scenarios. In this work, we used the Emotional Voice Messages (EMOVOME) database, including spontaneous voice messages from conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We created speaker-independent SER models using the eGeMAPS features, transformer-based models and their combination. We compared the results with reference databases and analyzed the influence of annotators and gender fairness. The pre-trained Unispeech-L model and its combination with eGeMAPS achieved the highest results, with 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in the prediction of emotion categories, while similar results were obtained in valence and arousal. Additionally, EMOVOME outcomes varied with annotator labels, showing superior results and better fairness when combining expert and non-expert annotations. This study significantly contributes to the evaluation of SER models in real-life situations, advancing in the development of applications for analyzing spontaneous voice messages.
We address a prime counting problem across the homology classes of a graph, presenting a graph-theoretical Dirichlet-type analogue of the prime number theorem. The main machinery we have developed and employed is a spectral antisymmetry theorem, revealing that the spectra of the twisted graph adjacency matrices have an antisymmetric distribution over the character group of the graph. Additionally, we derive some trace formulas based on the twisted adjacency matrices as part of our analysis.
This work develops, for the first time, a face-centred finite volume (FCFV) solver for the simulation of laminar and turbulent viscous incompressible flows. The formulation relies on the Reynolds-averaged Navier-Stokes (RANS) equations coupled with the negative Spalart-Allmaras (SA) model and three novel convective stabilisations, inspired by Riemann solvers, are derived and compared numerically. The resulting method achieves first-order convergence of the velocity, the velocity-gradient tensor and the pressure. FCFV accurately predicts engineering quantities of interest, such as drag and lift, on unstructured meshes and, by avoiding gradient reconstruction, the method is insensitive to mesh quality, even in the presence of highly distorted and stretched cells. A monolithic and a staggered solution strategies for the RANS-SA system are derived and compared numerically. Numerical benchmarks, involving laminar and turbulent, steady and transient cases are used to assess the performance, accuracy and robustness of the proposed FCFV method.
Rejective sampling improves design and estimation efficiency of single-phase sampling when auxiliary information in a finite population is available. When such auxiliary information is unavailable, we propose to use two-phase rejective sampling (TPRS), which involves measuring auxiliary variables for the sample of units in the first phase, followed by the implementation of rejective sampling for the outcome in the second phase. We explore the asymptotic design properties of double expansion and regression estimators under TPRS. We show that TPRS enhances the efficiency of the double expansion estimator, rendering it comparable to a regression estimator. We further refine the design to accommodate varying importance of covariates and extend it to multi-phase sampling.
Heterogeneous graph neural networks (HGNNs) as an emerging technique have shown superior capacity of dealing with heterogeneous information network (HIN). However, most HGNNs follow a semi-supervised learning manner, which notably limits their wide use in reality since labels are usually scarce in real applications. Recently, contrastive learning, a self-supervised method, becomes one of the most exciting learning paradigms and shows great potential when there are no labels. In this paper, we study the problem of self-supervised HGNNs and propose a novel co-contrastive learning mechanism for HGNNs, named HeCo. Different from traditional contrastive learning which only focuses on contrasting positive and negative samples, HeCo employs cross-viewcontrastive mechanism. Specifically, two views of a HIN (network schema and meta-path views) are proposed to learn node embeddings, so as to capture both of local and high-order structures simultaneously. Then the cross-view contrastive learning, as well as a view mask mechanism, is proposed, which is able to extract the positive and negative embeddings from two views. This enables the two views to collaboratively supervise each other and finally learn high-level node embeddings. Moreover, two extensions of HeCo are designed to generate harder negative samples with high quality, which further boosts the performance of HeCo. Extensive experiments conducted on a variety of real-world networks show the superior performance of the proposed methods over the state-of-the-arts.
Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings. The code and pre-trained models are available at //github.com/Albert-Ma/PROP.