In this paper, we introduce an alternative approach to enhancing Multi-Agent Reinforcement Learning (MARL) through the integration of domain knowledge and attention-based policy mechanisms. Our methodology focuses on the incorporation of domain-specific expertise into the learning process, which simplifies the development of collaborative behaviors. This approach aims to reduce the complexity and learning overhead typically associated with MARL by enabling agents to concentrate on essential aspects of complex tasks, thus optimizing the learning curve. The utilization of attention mechanisms plays a key role in our model. It allows for the effective processing of dynamic context data and nuanced agent interactions, leading to more refined decision-making. Applied in standard MARL scenarios, such as the Stanford Intelligent Systems Laboratory (SISL) Pursuit and Multi-Particle Environments (MPE) Simple Spread, our method has been shown to improve both learning efficiency and the effectiveness of collaborative behaviors. The results indicate that our attention-based approach can be a viable approach for improving the efficiency of MARL training process, integrating domain-specific knowledge at the action level.
In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31\% relative improvement over basic soft-prompting prefixLM.
In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. This tool effectively simplifies the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation. QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research.
In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors. The first problem is to approximate cuts in balanced directed graphs. In this problem, the goal is to build a data structure that $(1 \pm \epsilon)$-approximates cut values in graphs with $n$ vertices. For arbitrary directed graphs, such a data structure requires $\Omega(n^2)$ bits even for constant $\epsilon$. To circumvent this, recent works study $\beta$-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most $\beta$ times that in the other direction. We consider two models: the {\em for-each} model, where the goal is to approximate each cut with constant probability, and the {\em for-all} model, where all cuts must be preserved simultaneously. We improve the previous $\Omega(n \sqrt{\beta/\epsilon})$ lower bound to $\tilde{\Omega}(n \sqrt{\beta}/\epsilon)$ in the for-each model, and we improve the previous $\Omega(n \beta/\epsilon)$ lower bound to $\Omega(n \beta/\epsilon^2)$ in the for-all model. This resolves the main open questions of (Cen et al., ICALP, 2021). The second problem is to approximate the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We improve the previous $\Omega\bigl(\frac{m}{k}\bigr)$ query complexity lower bound to $\Omega\bigl(\min\{m, \frac{m}{\epsilon^2 k}\}\bigr)$ for this problem, where $m$ is the number of edges, $k$ is the size of the minimum cut, and we seek a $(1+\epsilon)$-approximation. In addition, we show that existing upper bounds with slight modifications match our lower bound up to logarithmic factors.
In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method is open-sourced in //github.com/Audio-WestlakeU/NBSS.
In this position paper, we explore the potential of generative AI (GenAI) tools in supporting HIV prevention initiatives among LGBTQ+ adolescents. GenAI offers opportunities to bridge information gaps and enhance healthcare access, yet it also risks exacerbating existing inequities through biased AI outputs reflecting heteronormative and cisnormative values. We advocate for the importance of queer adolescent-centered interventions, contend with the promise of GenAI tools while addressing concerns of bias, and position participatory frameworks for empowering queer youth in the design and development of AI tools. Viewing LGBTQ+ adolescents as designers, we propose a community-engaged approach to enable a group of queer teens with sexual health education expertise to design their own GenAI health tools. Through this collaborative effort, we put forward participatory ways to develop processes minimizing the potential iatrogenic harms of biased AI models, while harnessing AI benefits for LGBTQ+ teens. In this workshop, we offer specialized community-engaged knowledge in designing equitable AI tools to improve LGBTQ+ well-being.
In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
In this paper, we extend the Generalized Moving Least-Squares (GMLS) method in two different ways to solve the vector-valued PDEs on unknown smooth 2D manifolds without boundaries embedded in $\mathbb{R}^{3}$, identified with randomly sampled point cloud data. The two approaches are referred to as the intrinsic method and the extrinsic method. For the intrinsic method which relies on local approximations of metric tensors, we simplify the formula of Laplacians and covariant derivatives acting on vector fields at the base point by calculating them in a local Monge coordinate system. On the other hand, the extrinsic method formulates tangential derivatives on a submanifold as the projection of the directional derivative in the ambient Euclidean space onto the tangent space of the submanifold. One challenge of this method is that the discretization of vector Laplacians yields a matrix whose size relies on the ambient dimension. To overcome this issue, we reduce the dimension of vector Laplacian matrices by employing an appropriate projection. The complexity of both methods scales well with the dimension of manifolds rather than the ambient dimension. We also present supporting numerical examples, including eigenvalue problems, linear Poisson equations, and nonlinear Burgers' equations, to examine the numerical accuracy of proposed methods on various smooth manifolds.
In this paper, we introduce UniBridge (Cross-Lingual Transfer Learning with Optimized Embeddings and Vocabulary), a comprehensive approach developed to improve the effectiveness of Cross-Lingual Transfer Learning, particularly in languages with limited resources. Our approach tackles two essential elements of a language model: the initialization of embeddings and the optimal vocabulary size. Specifically, we propose a novel embedding initialization method that leverages both lexical and semantic alignment for a language. In addition, we present a method for systematically searching for the optimal vocabulary size, ensuring a balance between model complexity and linguistic coverage. Our experiments across multilingual datasets show that our approach greatly improves the F1-Score in several languages. UniBridge is a robust and adaptable solution for cross-lingual systems in various languages, highlighting the significance of initializing embeddings and choosing the right vocabulary size in cross-lingual environments.
In this research, we introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilises the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset, to capture a wide array of audio features that include both linguistic and paralinguistic components. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector, thereby embedding a broader contextual understanding into our analysis. A key aspect of our approach is the multi-task fusion strategy that not only leverages these features but also incorporates a pre-trained Valence-Arousal-Dominance (VAD) model. This integration is designed to refine emotion intensity prediction by concurrently processing multiple emotional dimensions, thereby embedding a richer contextual understanding into our framework. For the temporal analysis of audio data, our feature fusion process utilises a Long Short-Term Memory (LSTM) network. This approach, which relies solely on the provided audio data, shows marked advancements over the existing baseline, offering a more comprehensive understanding of emotional mimicry in naturalistic settings, achieving the second place in the EMI challenge.
This paper introduces the Visual Inverse Kinematics problem (VIK) to fill the gap between robot Inverse Kinematics (IK) and visual servo control. Different from the IK problem, the VIK problem seeks to find robot configurations subject to vision-based constraints, in addition to kinematic constraints. In this work, we develop a formulation of the VIK problem with a Field of View (FoV) constraint, enforcing the visibility of an object from a camera on the robot. Our proposed solution is based on the idea of adding a virtual kinematic chain connecting the physical robot and the object; the FoV constraint is then equivalent to a joint angle kinematic constraint. Along the way, we introduce multiple vision-based cost functions to fulfill different objectives. We solve this formulation of the VIK problem using a method that involves a semidefinite program (SDP) constraint followed by a rank minimization algorithm. The performance of this method for solving the VIK problem is validated through simulations.