We introduce a monaural neural speaker embeddings extractor that computes an embedding for each speaker present in a speech mixture. To allow for supervised training, a teacher-student approach is employed: the teacher computes the target embeddings from each speaker's utterance before the utterances are added to form the mixture, and the student embedding extractor is then tasked to reproduce those embeddings from the speech mixture at its input. The system much more reliably verifies the presence or absence of a given speaker in a mixture than a conventional speaker embedding extractor, and even exhibits comparable performance to a multi-channel approach that exploits spatial information for embedding extraction. Further, it is shown that a speaker embedding computed from a mixture can be used to check for the presence of that speaker in another mixture.
We present novel Monte Carlo (MC) and multilevel Monte Carlo (MLMC) methods for determining the unbiased covariance of random variables using h-statistics. The advantage of this procedure lies in the unbiased construction of the estimator's mean square error in a closed form. This is in contrast to the conventional MC and MLMC covariance estimators, which are based on biased mean square errors defined solely by upper bounds, particularly within the MLMC. Finally, the numerical results of the algorithms are demonstrated by estimating the covariance of the stochastic response of a simple 1D stochastic elliptic PDE such as Poisson's model.
We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.
Quadratic NURBS-based discretizations of the Galerkin method suffer from membrane locking when applied to Kirchhoff-Love shell formulations. Membrane locking causes not only smaller displacements than expected, but also large-amplitude spurious oscillations of the membrane forces. Continuous-assumed-strain (CAS) elements have been recently introduced to remove membrane locking in quadratic NURBS-based discretizations of linear plane curved Kirchhoff rods (Casquero et al., CMAME, 2022). In this work, we generalize CAS elements to vanquish membrane locking in quadratic NURBS-based discretizations of linear Kirchhoff-Love shells. CAS elements bilinearly interpolate the membrane strains at the four corners of each element. Thus, the assumed strains have C0 continuity across element boundaries. To the best of the authors' knowledge, CAS elements are the first assumed-strain treatment to effectively overcome membrane locking in quadratic NURBS-based discretizations of Kirchhoff-Love shells while satisfying the following important characteristics for computational efficiency: (1) No additional degrees of freedom are added, (2) No additional systems of algebraic equations need to be solved, (3) No matrix multiplications or matrix inversions are needed to obtain the stiffness matrix, and (4) The nonzero pattern of the stiffness matrix is preserved. The benchmark problems show that CAS elements, using either 2x2 or 3x3 Gauss-Legendre quadrature points per element, are an effective locking treatment since this element type results in more accurate displacements for coarse meshes and excises the spurious oscillations of the membrane forces. The benchmark problems also show that CAS elements outperform state-of-the-art element types based on Lagrange polynomials equipped with either assumed-strain or reduced-integration locking treatments.
Weight-sharing Neural Architecture Search (WS-NAS) provides an efficient mechanism for developing end-to-end deep recommender models. However, in complex search spaces, distinguishing between superior and inferior architectures (or paths) is challenging. This challenge is compounded by the limited coverage of the supernet and the co-adaptation of subnet weights, which restricts the exploration and exploitation capabilities inherent to weight-sharing mechanisms. To address these challenges, we introduce Farthest Greedy Path Sampling (FGPS), a new path sampling strategy that balances path quality and diversity. FGPS enhances path diversity to facilitate more comprehensive supernet exploration, while emphasizing path quality to ensure the effective identification and utilization of promising architectures. By incorporating FGPS into a Two-shot NAS (TS-NAS) framework, we derive high-performance architectures. Evaluations on three Click-Through Rate (CTR) prediction benchmarks demonstrate that our approach consistently achieves superior results, outperforming both manually designed and most NAS-based models.
Slope limiters play an essential role in maintaining the non-oscillatory behavior of high-resolution methods for nonlinear conservation laws. The family of minmod limiters serves as the prototype example. Here, we revisit the question of non-oscillatory behavior of high-resolution central schemes in terms of the slope limiter proposed by van Albada et. al. 1982. The van Albada (vA) limiter is smoother near extrema, and consequently, in many cases, it outperforms the results obtained using the standard minmod limiter. In particular, we prove that the vA limiter ensures 1D TVD stability and demonstrate that it yields noticeable improvement in computation of one- and two-dimensional systems.
Most micro- and macro-expression spotting methods in untrimmed videos suffer from the burden of video-wise collection and frame-wise annotation. Weakly-supervised expression spotting (WES) based on video-level labels can potentially mitigate the complexity of frame-level annotation while achieving fine-grained frame-level spotting. However, we argue that existing weakly-supervised methods are based on multiple instance learning (MIL) involving inter-modality, inter-sample, and inter-task gaps. The inter-sample gap is primarily from the sample distribution and duration. Therefore, we propose a novel and simple WES framework, MC-WES, using multi-consistency collaborative mechanisms that include modal-level saliency, video-level distribution, label-level duration and segment-level feature consistency strategies to implement fine frame-level spotting with only video-level labels to alleviate the above gaps and merge prior knowledge. The modal-level saliency consistency strategy focuses on capturing key correlations between raw images and optical flow. The video-level distribution consistency strategy utilizes the difference of sparsity in temporal distribution. The label-level duration consistency strategy exploits the difference in the duration of facial muscles. The segment-level feature consistency strategy emphasizes that features under the same labels maintain similarity. Experimental results on three challenging datasets -- CAS(ME)$^2$, CAS(ME)$^3$, and SAMM-LV -- demonstrate that MC-WES is comparable to state-of-the-art fully-supervised methods.
Gibbs samplers are popular algorithms to approximate posterior distributions arising from Bayesian hierarchical models. Despite their popularity and good empirical performances, however, there are still relatively few quantitative results on their convergence properties, e.g. much less than for gradient-based sampling methods. In this work we analyse the behaviour of total variation mixing times of Gibbs samplers targeting hierarchical models using tools from Bayesian asymptotics. We obtain dimension-free convergence results under random data-generating assumptions, for a broad class of two-level models with generic likelihood function. Specific examples with Gaussian, binomial and categorical likelihoods are discussed.
This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the hidden representations of the ASR model. The other system is non-intrusive and makes predictions with derived ASR uncertainty. The ASR model is only pretrained with a simulated noisy speech corpus and does not take advantage of the CPC2 data. For that reason, the intelligibility prediction systems are robust to unseen scenarios given the accurate prediction performance on the CPC2 evaluation.
Heterogeneous graph neural networks (HGNNs) as an emerging technique have shown superior capacity of dealing with heterogeneous information network (HIN). However, most HGNNs follow a semi-supervised learning manner, which notably limits their wide use in reality since labels are usually scarce in real applications. Recently, contrastive learning, a self-supervised method, becomes one of the most exciting learning paradigms and shows great potential when there are no labels. In this paper, we study the problem of self-supervised HGNNs and propose a novel co-contrastive learning mechanism for HGNNs, named HeCo. Different from traditional contrastive learning which only focuses on contrasting positive and negative samples, HeCo employs cross-viewcontrastive mechanism. Specifically, two views of a HIN (network schema and meta-path views) are proposed to learn node embeddings, so as to capture both of local and high-order structures simultaneously. Then the cross-view contrastive learning, as well as a view mask mechanism, is proposed, which is able to extract the positive and negative embeddings from two views. This enables the two views to collaboratively supervise each other and finally learn high-level node embeddings. Moreover, two extensions of HeCo are designed to generate harder negative samples with high quality, which further boosts the performance of HeCo. Extensive experiments conducted on a variety of real-world networks show the superior performance of the proposed methods over the state-of-the-arts.