Traditional Chinese medicine (TCM) prescription is the most critical form of TCM treatment, and uncovering the complex nonlinear relationship between symptoms and TCM is of great significance for clinical practice and assisting physicians in diagnosis and treatment. Although there have been some studies on TCM prescription generation, these studies consider a single factor and directly model the symptom-prescription generation problem mainly based on symptom descriptions, lacking guidance from TCM knowledge. To this end, we propose a RoBERTa and Knowledge Enhancement model for Prescription Generation of Traditional Chinese Medicine (RoKEPG). RoKEPG is firstly pre-trained by our constructed TCM corpus, followed by fine-tuning the pre-trained model, and the model is guided to generate TCM prescriptions by introducing four classes of knowledge of TCM through the attention mask matrix. Experimental results on the publicly available TCM prescription dataset show that RoKEPG improves the F1 metric by about 2% over the baseline model with the best results.
We systematically analyze the accuracy of Physics-Informed Neural Networks (PINNs) in approximating solutions to the critical Surface Quasi-Geostrophic (SQG) equation on two-dimensional periodic boxes. The critical SQG equation involves advection and diffusion described by nonlocal periodic operators, posing challenges for neural network-based methods that do not commonly exhibit periodic boundary conditions. In this paper, we present a novel approximation of these operators using their nonperiodic analogs based on singular integral representation formulas and use it to perform error estimates. This idea can be generalized to a larger class of nonlocal partial differential equations whose solutions satisfy prescribed boundary conditions, thereby initiating a new PINNs theory for equations with nonlocalities.
With the steady rise of the use of AI in bio-technical applications and the widespread adoption of genomics sequencing, an increasing amount of AI-based algorithms and tools is entering the research and production stage affecting critical decision-making streams like drug discovery and clinical outcomes. This paper demonstrates the vulnerability of AI models often utilized downstream tasks on recognized public genomics datasets. We undermine model robustness by deploying an attack that focuses on input transformation while mimicking the real data and confusing the model decision-making, ultimately yielding a pronounced deterioration in model performance. Further, we enhance our approach by generating poisoned data using a variational autoencoder-based model. Our empirical findings unequivocally demonstrate a decline in model performance, underscored by diminished accuracy and an upswing in false positives and false negatives. Furthermore, we analyze the resulting adversarial samples via spectral analysis yielding conclusions for countermeasures against such attacks.
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).
A standard treatment protocol for breast cancer entails administering neoadjuvant therapy followed by surgical removal of the tumor and surrounding tissue. Pathologists typically rely on cabinet X-ray radiographs, known as Faxitron, to examine the excised breast tissue and diagnose the extent of residual disease. However, accurately determining the location, size, and focality of residual cancer can be challenging, and incorrect assessments can lead to clinical consequences. The utilization of automated methods can improve the histopathology process, allowing pathologists to choose regions for sampling more effectively and precisely. Despite the recognized necessity, there are currently no such methods available. Training such automated detection models require accurate ground truth labels on ex-vivo radiology images, which can be acquired through registering Faxitron and histopathology images and mapping the extent of cancer from histopathology to x-ray images. This study introduces a deep learning-based image registration approach trained on mono-modal synthetic image pairs. The models were trained using data from 50 women who received neoadjuvant chemotherapy and underwent surgery. The results demonstrate that our method is faster and yields significantly lower average landmark error ($2.1\pm1.96$ mm) over the state-of-the-art iterative ($4.43\pm4.1$ mm) and deep learning ($4.02\pm3.15$ mm) approaches. Improved performance of our approach in integrating radiology and pathology information facilitates generating large datasets, which allows training models for more accurate breast cancer detection.
ChatGPT explores a strategic blueprint of question answering (QA) in delivering medical diagnosis, treatment recommendations, and other healthcare support. This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms. By transitioning the distribution of text, images, videos, and other modalities from the general domain to the medical domain, these techniques have expedited the progress of medical domain question answering (MDQA). They bridge the gap between human natural language and sophisticated medical domain knowledge or expert manual annotations, handling large-scale, diverse, unbalanced, or even unlabeled data analysis scenarios in medical contexts. Central to our focus is the utilizing of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements. Specialized tasks such as unimodal-related question answering, reading comprehension, reasoning, diagnosis, relation extraction, probability modeling, and others, as well as multimodal-related tasks like vision question answering, image caption, cross-modal retrieval, report summarization, and generation, are discussed in detail. Each section delves into the intricate specifics of the respective method under consideration. This paper highlights the structures and advancements of medical domain explorations against general domain methods, emphasizing their applications across different tasks and datasets. It also outlines current challenges and opportunities for future medical domain research, paving the way for continued innovation and application in this rapidly evolving field.
Nuclei instance segmentation in histopathological images is of great importance for biological analysis and cancer diagnosis but remains challenging for two reasons. (1) Similar visual presentation of intranuclear and extranuclear regions of chromophobe nuclei often causes under-segmentation, and (2) current methods lack the exploration of nuclei structure, resulting in fragmented instance predictions. To address these problems, this paper proposes a structure encoding and interaction network, termed SEINE, which develops the structure modeling scheme of nuclei and exploits the structure similarity between nuclei to improve the integrality of each segmented instance. Concretely, SEINE introduces a contour-based structure encoding (SE) that considers the correlation between nuclei structure and semantics, realizing a reasonable representation of the nuclei structure. Based on the encoding, we propose a structure-guided attention (SGA) that takes the clear nuclei as prototypes to enhance the structure learning for the fuzzy nuclei. To strengthen the structural learning ability, a semantic feature fusion (SFF) is presented to boost the semantic consistency of semantic and structure branches. Furthermore, a position enhancement (PE) method is applied to suppress incorrect nuclei boundary predictions. Extensive experiments demonstrate the superiority of our approaches, and SEINE achieves state-of-the-art (SOTA) performance on four datasets. The code is available at \href{//github.com/zhangye-zoe/SEINE}{//github.com/zhangye-zoe/SEINE}.
Theory of Mind (ToM), the ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data, which can include visual cues, linguistic narratives, or both. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
Understanding causality helps to structure interventions to achieve specific goals and enables predictions under interventions. With the growing importance of learning causal relationships, causal discovery tasks have transitioned from using traditional methods to infer potential causal structures from observational data to the field of pattern recognition involved in deep learning. The rapid accumulation of massive data promotes the emergence of causal search methods with brilliant scalability. Existing summaries of causal discovery methods mainly focus on traditional methods based on constraints, scores and FCMs, there is a lack of perfect sorting and elaboration for deep learning-based methods, also lacking some considers and exploration of causal discovery methods from the perspective of variable paradigms. Therefore, we divide the possible causal discovery tasks into three types according to the variable paradigm and give the definitions of the three tasks respectively, define and instantiate the relevant datasets for each task and the final causal model constructed at the same time, then reviews the main existing causal discovery methods for different tasks. Finally, we propose some roadmaps from different perspectives for the current research gaps in the field of causal discovery and point out future research directions.
Recently, Mutual Information (MI) has attracted attention in bounding the generalization error of Deep Neural Networks (DNNs). However, it is intractable to accurately estimate the MI in DNNs, thus most previous works have to relax the MI bound, which in turn weakens the information theoretic explanation for generalization. To address the limitation, this paper introduces a probabilistic representation of DNNs for accurately estimating the MI. Leveraging the proposed MI estimator, we validate the information theoretic explanation for generalization, and derive a tighter generalization bound than the state-of-the-art relaxations.
Many natural language processing tasks solely rely on sparse dependencies between a few tokens in a sentence. Soft attention mechanisms show promising performance in modeling local/global dependencies by soft probabilities between every two tokens, but they are not effective and efficient when applied to long sentences. By contrast, hard attention mechanisms directly select a subset of tokens but are difficult and inefficient to train due to their combinatorial nature. In this paper, we integrate both soft and hard attention into one context fusion model, "reinforced self-attention (ReSA)", for the mutual benefit of each other. In ReSA, a hard attention trims a sequence for a soft self-attention to process, while the soft attention feeds reward signals back to facilitate the training of the hard one. For this purpose, we develop a novel hard attention called "reinforced sequence sampling (RSS)", selecting tokens in parallel and trained via policy gradient. Using two RSS modules, ReSA efficiently extracts the sparse dependencies between each pair of selected tokens. We finally propose an RNN/CNN-free sentence-encoding model, "reinforced self-attention network (ReSAN)", solely based on ReSA. It achieves state-of-the-art performance on both Stanford Natural Language Inference (SNLI) and Sentences Involving Compositional Knowledge (SICK) datasets.