Word embedding has become ubiquitous and is widely used in various text mining and natural language processing (NLP) tasks, such as information retrieval, semantic analysis, and machine translation, among many others. Unfortunately, it is prohibitively expensive to train the word embedding in a relatively large corpus. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the stable vocabulary, relative idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large scale data set, and its performance advantage becomes more and more obvious with the growth of the training corpus. Extensive experiments conducted on real-world datasets show that the proposed algorithm outperforms traditional Skip-Gram by four-five times in terms of efficiency, while the error generated by the random walk sampling is small.
The newly released Segment Anything Model (SAM) is a popular tool used in image processing due to its superior segmentation accuracy, variety of input prompts, training capabilities, and efficient model design. However, its current model is trained on a diverse dataset not tailored to medical images, particularly ultrasound images. Ultrasound images tend to have a lot of noise, making it difficult to segment out important structures. In this project, we developed ClickSAM, which fine-tunes the Segment Anything Model using click prompts for ultrasound images. ClickSAM has two stages of training: the first stage is trained on single-click prompts centered in the ground-truth contours, and the second stage focuses on improving the model performance through additional positive and negative click prompts. By comparing the first stage predictions to the ground-truth masks, true positive, false positive, and false negative segments are calculated. Positive clicks are generated using the true positive and false negative segments, and negative clicks are generated using the false positive segments. The Centroidal Voronoi Tessellation algorithm is then employed to collect positive and negative click prompts in each segment that are used to enhance the model performance during the second stage of training. With click-train methods, ClickSAM exhibits superior performance compared to other existing models for ultrasound image segmentation.
Gaussian binomial coefficients are q-analogues of the binomial coefficients of integers. On the other hand, binomial coefficients have been extended to finite words, i.e., elements of the finitely generated free monoids. In this paper we bring together these two notions by introducing q-analogues of binomial coefficients of words. We study their basic properties, e.g., by extending classical formulas such as the q-Vandermonde and Manvel's et al. identities to our setting. As a consequence, we get information about the structure of the considered words: these q-deformations of binomial coefficients of words contain much richer information than the original coefficients. From an algebraic perspective, we introduce a q-shuffle and a family q-infiltration products for non-commutative formal power series. Finally, we apply our results to generalize a theorem of Eilenberg characterizing so-called p-group languages. We show that a language is of this type if and only if it is a Boolean combination of specific languages defined through q-binomial coefficients seen as polynomials over $\mathbb{F}_p$.
We have introduced a q-deformation, i.e., a polynomial in q with natural coefficients, of the binomial coefficient of two finite words u and v counting the number of occurrences of v as a subword of u. In this paper, we examine the q-deformation of Parikh matrices as introduced by E\u{g}ecio\u{g}lu in 2004. Many classical results concerning Parikh matrices generalize to this new framework: Our first important observation is that the elements of such a matrix are in fact q-deformations of binomial coefficients of words. We also study their inverses and as an application, we obtain new identities about q-binomials. For a finite word z and for the sequence $(p_n)_{n\ge 0}$ of prefixes of an infinite word, we show that the polynomial sequence $\binom{p_n}{z}_q$ converges to a formal series. We present links with additive number theory and k-regular sequences. In the case of a periodic word $u^\omega$, we generalize a result of Salomaa: the sequence $\binom{u^n}{z}_q$ satisfies a linear recurrence relation with polynomial coefficients. Related to the theory of integer partition, we describe the growth and the zero set of the coefficients of the series associated with $u^\omega$. Finally, we show that the minors of a q-Parikh matrix are polynomials with natural coefficients and consider a generalization of Cauchy's inequality. We also compare q-Parikh matrices associated with an arbitrary word with those associated with a canonical word $12\cdots k$ made of pairwise distinct symbols.
Instruction tuning of the Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflicts for the same set of model parameters, resulting in sub-optimal instruction-following abilities. To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve the generalization capabilities of MoCLE for novel instructions. Extensive experiments on 10 zero-shot tasks demonstrate the effectiveness of MoCLE.
Current state-of-the-art 6d pose estimation is too compute intensive to be deployed on edge devices, such as Microsoft HoloLens (2) or Apple iPad, both used for an increasing number of augmented reality applications. The quality of AR is greatly dependent on its capabilities to detect and overlay geometry within the scene. We propose a synthetically trained client-server-based augmented reality application, demonstrating state-of-the-art object pose estimation of metallic and texture-less industry objects on edge devices. Synthetic data enables training without real photographs, i.e. for yet-to-be-manufactured objects. Our qualitative evaluation on an AR-assisted sorting task, and quantitative evaluation on both renderings, as well as real-world data recorded on HoloLens 2, sheds light on its real-world applicability.
Advances in AI invite misuse of language models as replacements for human participants. We argue that treating their responses as glimpses into an average human mind fundamentally mischaracterizes these statistical algorithms and that language models should be embraced as flexible simulation tools, able to mimic diverse behaviors without possessing human traits themselves.
Lateral flow immunoassays (LFIA) are widely used worldwide for the detection of different analytes because they combine multiple advantages such as low production cost, simplicity, and portability, which allows biomarkers detection without requiring infrastructure or highly trained personnel. Here we propose to provide solutions to the manufacturing process of LFIA at laboratory-scale, particularly to the controlled and active dispensing of the reagents in the form the Test Lines (TL) and the Control Lines (CL). To accomplish this task, we adapted a 3D printer to also control Syringe Pumps (SP), since the proposed adaptation of a 3D printer is easy, free and many laboratories already have it in their infrastructure. In turn, the standard function of the 3D printer can be easily restored by disconnecting the SPs and reconnecting the extruder. Additionally, the unified control of the 3D printer enables dual, active, regulable and simultaneous dispensing, four features that are typically found only in certain high-cost commercial equipment. With the proposed setup, the challenge of dispensing simultaneously at least 2 lines (CL and TL) with SPs controlled by a 3D printer was addressed, including regulation in the width of dispensed lines within experimental limits. Also, the construction of a LFIA for the detection of leptospirosis is shown as a practical example of automatized reagent dispensing.
Research on gender and language is tightly knitted to social debates on gender equality and non-discriminatory language use. Psycholinguistic scholars have made significant contributions in this field. However, corpus-based studies that investigate these matters within the context of language use are still rare. In our study, we address the question of how much textual material would actually have to be changed if non-gender-inclusive texts were rewritten to be gender-inclusive. This quantitative measure is an important empirical insight, as a recurring argument against the use of gender-inclusive German is that it supposedly makes written texts too long and complicated. It is also argued that gender-inclusive language has negative effects on language learners. However, such effects are only likely if gender-inclusive texts are very different from those that are not gender-inclusive. In our corpus-linguistic study, we manually annotated German press texts to identify the parts that would have to be changed. Our results show that, on average, less than 1% of all tokens would be affected by gender-inclusive language. This small proportion calls into question whether gender-inclusive German presents a substantial barrier to understanding and learning the language, particularly when we take into account the potential complexities of interpreting masculine generics.
Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (i.e., utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance, video and audio, which are vital clues for sarcasm explanation. In fact, it is non-trivial to incorporate sentiments for boosting SED performance, due to three main challenges: 1) diverse effects of utterance tokens on sentiments; 2) gap between video-audio sentiment signals and the embedding space of BART; and 3) various relations among utterances, utterance sentiments, and video-audio sentiments. To tackle these challenges, we propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a heuristic utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip. Thereafter, we devise a context-sentiment graph to comprehensively model the semantic relations among the utterances, utterance sentiments, and video-audio sentiments, to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.
Speech intelligibility can be degraded due to multiple factors, such as noisy environments, technical difficulties or biological conditions. This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level in this latter case. The main contribution of our research on this topic is the use of Long Short-Term Memory (LSTM) networks with log-mel spectrograms as input features for this purpose. In addition, this LSTM-based system is further enhanced by the incorporation of a simple attention mechanism that is able to determine the more relevant frames to this task. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. Results show that the attention LSTM architecture outperforms both, a reference Support Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based system with Mean-Pooling.