When building a new application we are increasingly confronted with the need of reusing and integrating pre-existing knowledge. Nevertheless, it is a fact that this prior knowledge is virtually impossible to reuse as-is. This is true also in domains, e.g., eHealth, where a lot of effort has been put into developing high-quality standards and reference ontologies, e.g. FHIR1. In this paper, we propose an integrated methodology, called iTelos, which enables data and knowledge reuse towards the construction of Interoperable Electronic Health Records (iEHR). The key intuition is that the data level and the schema level of an application should be developed independently, thus allowing for maximum flexibility in the reuse of the prior knowledge, but under the overall guidance of the needs to be satisfied, formalized as competence queries. This intuition is implemented by codifying all the requirements, including those concerning reuse, as part of a purpose defined a priori, which is then used to drive a middle-out development process where the application schema and data are continuously aligned. The proposed methodology is validated through its application to a large-scale case study.
We present a new text-to-SQL dataset for electronic health records (EHRs). The utterances were collected from 222 hospital staff members, including physicians, nurses, and insurance review and health records teams. To construct the QA dataset on structured EHR data, we conducted a poll at a university hospital and used the responses to create seed questions. We then manually linked these questions to two open-source EHR databases, MIMIC-III and eICU, and included various time expressions and held-out unanswerable questions in the dataset, which were also collected from the poll. Our dataset poses a unique set of challenges: the model needs to 1) generate SQL queries that reflect a wide range of needs in the hospital, including simple retrieval and complex operations such as calculating survival rate, 2) understand various time expressions to answer time-sensitive questions in healthcare, and 3) distinguish whether a given question is answerable or unanswerable. We believe our dataset, EHRSQL, can serve as a practical benchmark for developing and assessing QA models on structured EHR data and take a step further towards bridging the gap between text-to-SQL research and its real-life deployment in healthcare. EHRSQL is available at //github.com/glee4810/EHRSQL.
The Electronic Health Record (EHR) is an essential part of the modern medical system and impacts healthcare delivery, operations, and research. Unstructured text is attracting much attention despite structured information in the EHRs and has become an exciting research field. The success of the recent neural Natural Language Processing (NLP) method has led to a new direction for processing unstructured clinical notes. In this work, we create a python library for clinical texts, EHRKit. This library contains two main parts: MIMIC-III-specific functions and tasks specific functions. The first part introduces a list of interfaces for accessing MIMIC-III NOTEEVENTS data, including basic search, information retrieval, and information extraction. The second part integrates many third-party libraries for up to 12 off-shelf NLP tasks such as named entity recognition, summarization, machine translation, etc.
Marine litter poses significant threats to marine and coastal environments, with its impacts ever-growing. Remote sensing provides an advantageous supplement to traditional mitigation techniques, such as local cleaning operations and trawl net surveys, due to its capabilities for extensive coverage and frequent observation. In this study, we used Radiative Transfer Model (RTM) simulated data and data from the Multispectral Instrument (MSI) of the Sentinel-2 mission in combination with machine learning algorithms. Our aim was to study the spectral behavior of marine plastic pollution and evaluate the applicability of RTMs within this research area. The results from the exploratory analysis and unsupervised classification using the KMeans algorithm indicate that the spectral behavior of pollutants is influenced by factors such as the type of polymer and pixel coverage percentage. The findings also reveal spectral characteristics and trends of association and differentiation among elements. The applied methodology is strongly dependent on the data, and if reapplied in new, more diverse, and detailed datasets, it can potentially generate even better results. These insights can guide future research in remote sensing applications for detecting marine plastic pollution.
We develop a new model for spatial random field reconstruction of a binary-valued spatial phenomenon. In our model, sensors are deployed in a wireless sensor network across a large geographical region. Each sensor measures a non-Gaussian inhomogeneous temporal process which depends on the spatial phenomenon. Two types of sensors are employed: one collects point observations at specific time points, while the other collects integral observations over time intervals. Subsequently, the sensors transmit these time-series observations to a Fusion Center (FC), and the FC infers the spatial phenomenon from these observations. We show that the resulting posterior predictive distribution is intractable and develop a tractable two-step procedure to perform inference. Firstly, we develop algorithms to perform approximate Likelihood Ratio Tests on the time-series observations, compressing them to a single bit for both point sensors and integral sensors. Secondly, once the compressed observations are transmitted to the FC, we utilize a Spatial Best Linear Unbiased Estimator (S-BLUE) to reconstruct the binary spatial random field at any desired spatial location. The performance of the proposed approach is studied using simulation. We further illustrate the effectiveness of our method using a weather dataset from the National Environment Agency (NEA) of Singapore with fields including temperature and relative humidity.
Sensitive information is intrinsically tied to interactions in healthcare, and its protection is of paramount importance for achieving high-quality patient outcomes. Research in healthcare privacy and security is predominantly focused on understanding the factors that increase the susceptibility of users to privacy and security breaches. To understand further, we systematically review 26 research papers in this domain to explore the existing user studies in healthcare privacy and security. Following the review, we conducted a card-sorting exercise, allowing us to identify 12 themes integral to this subject such as "Data Sharing," "Risk Awareness," and "Privacy." Further to the identification of these themes, we performed an in-depth analysis of the 26 research papers report on the insights into the discourse within the research community about healthcare privacy and security, particularly from the user perspective.
Gravitational lensing is the relativistic effect generated by massive bodies, which bend the space-time surrounding them. It is a deeply investigated topic in astrophysics and allows validating theoretical relativistic results and studying faint astrophysical objects that would not be visible otherwise. In recent years Machine Learning methods have been applied to support the analysis of the gravitational lensing phenomena by detecting lensing effects in data sets consisting of images associated with brightness variation time series. However, the state-of-art approaches either consider only images and neglect time-series data or achieve relatively low accuracy on the most difficult data sets. This paper introduces DeepGraviLens, a novel multi-modal network that classifies spatio-temporal data belonging to one non-lensed system type and three lensed system types. It surpasses the current state of the art accuracy results by $\approx 3\%$ to $\approx 11\%$, depending on the considered data set. Such an improvement will enable the acceleration of the analysis of lensed objects in upcoming astrophysical surveys, which will exploit the petabytes of data collected, e.g., from the Vera C. Rubin Observatory.
Transformer-based language models have achieved impressive success in various natural language processing tasks due to their ability to capture complex dependencies and contextual information using self-attention mechanisms. However, they are not without limitations. These limitations include hallucinations, where they produce incorrect outputs with high confidence, and alignment issues, where they generate unhelpful and unsafe outputs for human users. These limitations stem from the absence of implicit and missing context in the data alone. To address this, researchers have explored augmenting these models with external knowledge from knowledge graphs to provide the necessary additional context. However, the ad-hoc nature of existing methods makes it difficult to properly analyze the effects of knowledge infusion on the many moving parts or components of a transformer. This paper introduces a systematic method for infusing knowledge into different components of a transformer-based model. A modular framework is proposed to identify specific components within the transformer architecture, such as the self-attention mechanism, encoder layers, or the input embedding layer, where knowledge infusion can be applied. Additionally, extensive experiments are conducted on the General Language Understanding Evaluation (GLUE) benchmark tasks, and the findings are reported. This systematic approach aims to facilitate more principled approaches to incorporating knowledge into language model architectures.
Time-to-event (TTE) models are used in medicine and other fields for estimating the probability distribution of the time until a specific event occurs. TTE models provide many advantages over classification using fixed time horizons, including naturally handling censored observations, but require more parameters and are challenging to train in settings with limited labeled data. Existing approaches, e.g. proportional hazards or accelerated failure time, employ distributional assumptions to reduce parameters but are vulnerable to model misspecification. In this work, we address these challenges with MOTOR (Many Outcome Time Oriented Representations), a self-supervised model that leverages temporal structure found in collections of timestamped events in electronic health records (EHR) and health insurance claims. MOTOR uses a TTE pretraining objective that predicts the probability distribution of times when events occur, making it well-suited to transfer learning for medical prediction tasks. Having pretrained on EHR and claims data of up to 55M patient records (9B clinical events), we evaluate performance after finetuning for 19 tasks across two datasets. Task-specific models built using MOTOR improve time-dependent C statistics by 4.6% over state-of-the-art while greatly improving sample efficiency, achieving comparable performance to existing methods using only 5% of available task data.
Data in Knowledge Graphs often represents part of the current state of the real world. Thus, to stay up-to-date the graph data needs to be updated frequently. To utilize information from Knowledge Graphs, many state-of-the-art machine learning approaches use embedding techniques. These techniques typically compute an embedding, i.e., vector representations of the nodes as input for the main machine learning algorithm. If a graph update occurs later on -- specifically when nodes are added or removed -- the training has to be done all over again. This is undesirable, because of the time it takes and also because downstream models which were trained with these embeddings have to be retrained if they change significantly. In this paper, we investigate embedding updates that do not require full retraining and evaluate them in combination with various embedding models on real dynamic Knowledge Graphs covering multiple use cases. We study approaches that place newly appearing nodes optimally according to local information, but notice that this does not work well. However, we find that if we continue the training of the old embedding, interleaved with epochs during which we only optimize for the added and removed parts, we obtain good results in terms of typical metrics used in link prediction. This performance is obtained much faster than with a complete retraining and hence makes it possible to maintain embeddings for dynamic Knowledge Graphs.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.