A new loss function for speaker recognition with deep neural network is proposed, based on Jeffreys Divergence. Adding this divergence to the cross-entropy loss function allows to maximize the target value of the output distribution while smoothing the non-target values. This objective function provides highly discriminative features. Beyond this effect, we propose a theoretical justification of its effectiveness and try to understand how this loss function affects the model, in particular the impact on dataset types (i.e. in-domain or out-of-domain w.r.t the training corpus). Our experiments show that Jeffreys loss consistently outperforms the state-of-the-art for speaker recognition, especially on out-of-domain data, and helps limit false alarms.
Mesh-based Graph Neural Networks (GNNs) have recently shown capabilities to simulate complex multiphysics problems with accelerated performance times. However, mesh-based GNNs require a large number of message-passing (MP) steps and suffer from over-smoothing for problems involving very fine mesh. In this work, we develop a multiscale mesh-based GNN framework mimicking a conventional iterative multigrid solver, coupled with adaptive mesh refinement (AMR), to mitigate challenges with conventional mesh-based GNNs. We use the framework to accelerate phase field (PF) fracture problems involving coupled partial differential equations with a near-singular operator due to near-zero modulus inside the crack. We define the initial graph representation using all mesh resolution levels. We perform a series of downsampling steps using Transformer MP GNNs to reach the coarsest graph followed by upsampling steps to reach the original graph. We use skip connectors from the generated embedding during coarsening to prevent over-smoothing. We use Transfer Learning (TL) to significantly reduce the size of training datasets needed to simulate different crack configurations and loading conditions. The trained framework showed accelerated simulation times, while maintaining high accuracy for all cases compared to physics-based PF fracture model. Finally, this work provides a new approach to accelerate a variety of mesh-based engineering multiphysics problems
Autoencoders (AE) are simple yet powerful class of neural networks that compress data by projecting input into low-dimensional latent space (LS). Whereas LS is formed according to the loss function minimization during training, its properties and topology are not controlled directly. In this paper we focus on AE LS properties and propose two methods for obtaining LS with desired topology, called LS configuration. The proposed methods include loss configuration using a geometric loss term that acts directly in LS, and encoder configuration. We show that the former allows to reliably obtain LS with desired configuration by defining the positions and shapes of LS clusters for supervised AE (SAE). Knowing LS configuration allows to define similarity measure in LS to predict labels or estimate similarity for multiple inputs without using decoders or classifiers. We also show that this leads to more stable and interpretable training. We show that SAE trained for clothes texture classification using the proposed method generalizes well to unseen data from LIP, Market1501, and WildTrack datasets without fine-tuning, and even allows to evaluate similarity for unseen classes. We further illustrate the advantages of pre-configured LS similarity estimation with cross-dataset searches and text-based search using a text query without language models.
We adopt the integral definition of the fractional Laplace operator and study an optimal control problem on Lipschitz domains that involves a fractional elliptic partial differential equation (PDE) as state equation and a control variable that enters the state equation as a coefficient; pointwise constraints on the control variable are considered as well. We establish the existence of optimal solutions and analyze first and, necessary and sufficient, second order optimality conditions. Regularity estimates for optimal variables are also analyzed. We develop two finite element discretization strategies: a semidiscrete scheme in which the control variable is not discretized, and a fully discrete scheme in which the control variable is discretized with piecewise constant functions. For both schemes, we analyze the convergence properties of discretizations and derive error estimates.
Successfully addressing a wide variety of tasks is a core ability of autonomous agents, which requires flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the underlying perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, in this work, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the policy and visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks of the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given visual demonstrations.
The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.
Context: Experiment replications play a central role in the scientific method. Although software engineering experimentation has matured a great deal, the number of experiment replications is still relatively small. Software engineering experiments are composed of complex concepts, procedures and artefacts. Laboratory packages are a means of transfer-ring knowledge among researchers to facilitate experiment replications. Objective: This paper investigates the experiment replication process to find out what information is needed to successfully replicate an experiment. Our objective is to propose the content and structure of laboratory packages for software engineering experiments. Method: We evaluated seven replications of three different families of experiments. Each replication had a different experimenter who was, at the time, unfamiliar with the experi-ment. During the first iterations of the study, we identified experimental incidents and then proposed a laboratory package structure that addressed these incidents, including docu-ment usability improvements. We used the later iterations to validate and generalize the laboratory package structure for use in all software engineering experiments. We aimed to solve a specific problem, while at the same time looking at how to contribute to the body of knowledge on laboratory packages. Results: We generated a laboratory package for three different experiments. These packages eased the replication of the respective experiments. The evaluation that we conducted shows that the laboratory package proposal is acceptable and reduces the effort currently required to replicate experiments in software engineering. Conclusion: We think that the content and structure that we propose for laboratory pack-ages can be useful for other software engineering experiments.
We propose a novel neural network architecture based on conformer transducer that adds contextual information flow to the ASR systems. Our method improves the accuracy of recognizing uncommon words while not harming the word error rate of regular words. We explore the uncommon words accuracy improvement when we use the new model and/or shallow fusion with context language model. We found that combination of both provides cumulative gain in uncommon words recognition accuracy.
When generating in-silico clinical electrophysiological outputs, such as electrocardiograms (ECGs) and body surface potential maps (BSPMs), mathematical models have relied on single physics, i.e. of the cardiac electrophysiology (EP), neglecting the role of the heart motion. Since the heart is the most powerful source of electrical activity in the human body, its motion dynamically shifts the position of the principal electrical sources in the torso, influencing electrical potential distribution and potentially altering the EP outputs. In this work, we propose a computational model for the simulation of ECGs and BSPMs by coupling a cardiac electromechanical model with a model that simulates the propagation of the EP signal in the torso, thanks to a flexible numerical approach, that simulates the torso domain deformation induced by the myocardial displacement. Our model accounts for the major mechano-electrical feedbacks, along with unidirectional displacement and potential couplings from the heart to the surrounding body. For the numerical discretization, we employ a versatile intergrid transfer operator that allows for the use of different Finite Element spaces to be used in the cardiac and torso domains. Our numerical results are obtained on a realistic 3D biventricular-torso geometry, and cover both cases of sinus rhythm and ventricular tachycardia (VT), solving both the electromechanical-torso model in dynamical domains, and the classical electrophysiology-torso model in static domains. By comparing standard 12-lead ECG and BSPMs, we highlight the non-negligible effects of the myocardial contraction on the EP-outputs, especially in pathological conditions, such as the VT.
Whole slide imaging is fundamental to biomedical microscopy and computational pathology. However, whole slide images (WSIs) present a complex computer vision challenge due to their gigapixel size, diverse histopathologic features, spatial heterogeneity, and limited/absent data annotations. These challenges highlight that supervised training alone can result in suboptimal whole slide representations. Self-supervised representation learning can achieve high-quality WSI visual feature learning for downstream diagnostic tasks, such as cancer diagnosis or molecular genetic prediction. Here, we present a general self-supervised whole slide learning (S3L) framework for gigapixel-scale self-supervision of WSIs. S3L combines data transformation strategies from transformer-based vision and language modeling into a single unified framework to generate paired views for self-supervision. S3L leverages the inherent regional heterogeneity, histologic feature variability, and information redundancy within WSIs to learn high-quality whole-slide representations. We benchmark S3L visual representations on two diagnostic tasks for two biomedical microscopy modalities. S3L significantly outperforms WSI baselines for cancer diagnosis and genetic mutation prediction. Additionally, S3L achieves good performance using both in-domain and out-of-distribution patch encoders, demonstrating good flexibility and generalizability.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.