For IoT to reach its full potential, the sharing and reuse of information in different applications and across verticals is of paramount importance. However, there are a plethora of IoT platforms using different representations, protocols and interaction patterns. To address this issue, the Fed4IoT project has developed an IoT virtualization platform that, on the one hand, integrates information from many different source platforms and, on the other hand, makes the information required by the respective users available in the target platform of choice. To enable this, information is translated into a common, neutral exchange format. The format of choice is NGSI-LD, which is being standardized by the ETSI Industry Specification Group on Context Information Management (ETSI ISG CIM). Thing Visors are the components that translate the source information to NGSI-LD, which is then delivered to the target platform and translated into the target format. ThingVisors can be implemented by hand, but this requires significant human effort, especially considering the heterogeneity of low level information produced by a multitude of sensors. Thus, supporting the human developer and, ideally, fully automating the process of extracting and enriching data and translating it to NGSI-LD is a crucial step. Machine learning is a promising approach for this, but it typically requires large amounts of hand-labelled data for training, an effort that makes it unrealistic in many IoT scenarios. A programmatic labelling approach called knowledge infusion that encodes expert knowledge is used for matching a schema or ontology extracted from the data with a target schema or ontology, providing the basis for annotating the data and facilitating the translation to NGSI-LD.
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
Industry 4.0 uses a subset of the IoT, named Industrial IoT (IIoT), to achieve connectivity, interoperability, and decentralization. The deployment of industrial networks rarely considers security by design, but this becomes imperative in smart manufacturing as connectivity increases. The combination of OT and IT infrastructures in Industry 4.0 adds new security threats beyond those of traditional industrial networks. Defence-in-Depth (DiD) strategies tackle the complexity of this problem by providing multiple defense layers, each of these focusing on a particular set of threats. Additionally, the strict requirements of IIoT networks demand lightweight encryption algorithms. Nevertheless, these ciphers must provide E2E (End-to-End) security, as data passes through intermediate entities or middleboxes before reaching their destination. If compromised, middleboxes could expose vulnerable information to potential attackers if it is not encrypted throughout this path. This paper presents an analysis of the most relevant security strategies in Industry 4.0, focusing primarily on DiD. With these in mind, it proposes a combination of DiD, an encryption algorithm called Attribute-Based-Encryption (ABE), and object security (i.e., OSCORE) to get an E2E security approach. This analysis is a critical first step to developing more complex and lightweight security frameworks suitable for Industry 4.0.
The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.
Commonsense knowledge (CSK) about concepts and their properties is useful for AI applications such as robust chatbots. Prior works like ConceptNet, TupleKB and others compiled large CSK collections, but are restricted in their expressiveness to subject-predicate-object (SPO) triples with simple concepts for S and monolithic strings for P and O. Also, these projects have either prioritized precision or recall, but hardly reconcile these complementary goals. This paper presents a methodology, called Ascent, to automatically build a large-scale knowledge base (KB) of CSK assertions, with advanced expressiveness and both better precision and recall than prior works. Ascent goes beyond triples by capturing composite concepts with subgroups and aspects, and by refining assertions with semantic facets. The latter are important to express temporal and spatial validity of assertions and further qualifiers. Ascent combines open information extraction with judicious cleaning using language models. Intrinsic evaluation shows the superior size and quality of the Ascent KB, and an extrinsic evaluation for QA-support tasks underlines the benefits of Ascent.
Knowledge is a formal way of understanding the world, providing a human-level cognition and intelligence for the next-generation artificial intelligence (AI). One of the representations of knowledge is the structural relations between entities. An effective way to automatically acquire this important knowledge, called Relation Extraction (RE), a sub-task of information extraction, plays a vital role in Natural Language Processing (NLP). Its purpose is to identify semantic relations between entities from natural language text. To date, there are several studies for RE in previous works, which have documented these techniques based on Deep Neural Networks (DNNs) become a prevailing technique in this research. Especially, the supervised and distant supervision methods based on DNNs are the most popular and reliable solutions for RE. This article 1)introduces some general concepts, and further 2)gives a comprehensive overview of DNNs in RE from two points of view: supervised RE, which attempts to improve the standard RE systems, and distant supervision RE, which adopts DNNs to design the sentence encoder and the de-noise method. We further 3)cover some novel methods and describe some recent trends and discuss possible future research directions for this task.
With the rapid development in online education, knowledge tracing (KT) has become a fundamental problem which traces students' knowledge status and predicts their performance on new questions. Questions are often numerous in online education systems, and are always associated with much fewer skills. However, the previous literature fails to involve question information together with high-order question-skill correlations, which is mostly limited by data sparsity and multi-skill problems. From the model perspective, previous models can hardly capture the long-term dependency of student exercise history, and cannot model the interactions between student-questions, and student-skills in a consistent way. In this paper, we propose a Graph-based Interaction model for Knowledge Tracing (GIKT) to tackle the above probems. More specifically, GIKT utilizes graph convolutional network (GCN) to substantially incorporate question-skill correlations via embedding propagation. Besides, considering that relevant questions are usually scattered throughout the exercise history, and that question and skill are just different instantiations of knowledge, GIKT generalizes the degree of students' master of the question to the interactions between the student's current state, the student's history related exercises, the target question, and related skills. Experiments on three datasets demonstrate that GIKT achieves the new state-of-the-art performance, with at least 1% absolute AUC improvement.
In relation extraction for knowledge-based question answering, searching from one entity to another entity via a single relation is called "one hop". In related work, an exhaustive search from all one-hop relations, two-hop relations, and so on to the max-hop relations in the knowledge graph is necessary but expensive. Therefore, the number of hops is generally restricted to two or three. In this paper, we propose UHop, an unrestricted-hop framework which relaxes this restriction by use of a transition-based search framework to replace the relation-chain-based search one. We conduct experiments on conventional 1- and 2-hop questions as well as lengthy questions, including datasets such as WebQSP, PathQuestion, and Grid World. Results show that the proposed framework enables the ability to halt, works well with state-of-the-art models, achieves competitive performance without exhaustive searches, and opens the performance gap for long relation paths.
In this paper, we propose a span based model combined with syntactic information for n-ary open information extraction. The advantage of span model is that it can leverage span level features, which is difficult in token based BIO tagging methods. We also improve the previous bootstrap method to construct training corpus. Experiments show that our model outperforms previous open information extraction systems. Our code and data are publicly available at //github.com/zhanjunlang/Span_OIE
In recent years with the rise of Cloud Computing (CC), many companies providing services in the cloud, are empowered a new series of services to their catalog, such as data mining (DM) and data processing, taking advantage of the vast computing resources available to them. Different service definition proposals have been proposed to address the problem of describing services in CC in a comprehensive way. Bearing in mind that each provider has its own definition of the logic of its services, and specifically of DM services, it should be pointed out that the possibility of describing services in a flexible way between providers is fundamental in order to maintain the usability and portability of this type of CC services. The use of semantic technologies based on the proposal offered by Linked Data (LD) for the definition of services, allows the design and modelling of DM services, achieving a high degree of interoperability. In this article a schema for the definition of DM services on CC is presented, in addition are considered all key aspects of service in CC, such as prices, interfaces, Software Level Agreement, instances or workflow of experimentation, among others. The proposal presented is based on LD, so that it reuses other schemata obtaining a best definition of the service. For the validation of the schema, a series of DM services have been created where some of the best known algorithms such as \textit{Random Forest} or \textit{KMeans} are modeled as services.
Internet of Things (IoT) infrastructure within the physical library environment is the basis for an integrative, hybrid approach to digital resource recommenders. The IoT infrastructure provides mobile, dynamic wayfinding support for items in the collection, which includes features for location-based recommendations. The evaluation and analysis herein clarified the nature of users' requests for recommendations based on their location, and describes subject areas of the library for which users request recommendations. The results indicated that users of IoT-based recommendations are interested in a broad distribution of subjects, with a short-head distribution from this collection in American and English Literature. A long-tail finding showed a diversity of topics that are recommended to users in the library book stacks with IoT-powered recommendations.