There is no unified definition of Data anomalies, which refers to the specific data operation mode that may violate the consistency of the database. Known data anomalies include Dirty Write, Dirty Read, Non-repeatable Read, Phantom, Read Skew and Write Skew, etc. In order to improve the efficiency of concurrency control algorithms, data anomalies are also used to define the isolation levels, because the weak isolation level can improve the efficiency of transaction processing systems. This paper systematically studies the data anomalies and the corresponding isolation levels. We report twenty-two new data anomalies that other papers have not reported, and all data anomalies are classified miraculously. Based on the classification of data anomalies, two new isolation levels systems with different granularity are proposed, which reveals the rule of defining isolation levels based on data anomalies and makes the cognition of data anomalies and isolation levels more concise.
We propose and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.
Out-of-distribution (OOD) detection is critical to ensuring the reliability and safety of machine learning systems. For instance, in autonomous driving, we would like the driving system to issue an alert and hand over the control to humans when it detects unusual scenes or objects that it has never seen before and cannot make a safe decision. This problem first emerged in 2017 and since then has received increasing attention from the research community, leading to a plethora of methods developed, ranging from classification-based to density-based to distance-based ones. Meanwhile, several other problems are closely related to OOD detection in terms of motivation and methodology. These include anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). Despite having different definitions and problem settings, these problems often confuse readers and practitioners, and as a result, some existing studies misuse terms. In this survey, we first present a generic framework called generalized OOD detection, which encompasses the five aforementioned problems, i.e., AD, ND, OSR, OOD detection, and OD. Under our framework, these five problems can be seen as special cases or sub-tasks, and are easier to distinguish. Then, we conduct a thorough review of each of the five areas by summarizing their recent technical developments. We conclude this survey with open challenges and potential research directions.
One-Class Classification (OCC) is a special case of multi-class classification, where data observed during training is from a single positive class. The goal of OCC is to learn a representation and/or a classifier that enables recognition of positively labeled queries during inference. This topic has received considerable amount of interest in the computer vision, machine learning and biometrics communities in recent years. In this article, we provide a survey of classical statistical and recent deep learning-based OCC methods for visual recognition. We discuss the merits and drawbacks of existing OCC approaches and identify promising avenues for research in this field. In addition, we present a discussion of commonly used datasets and evaluation metrics for OCC.
Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.
Edge intelligence refers to a set of connected systems and devices for data collection, caching, processing, and analysis in locations close to where data is captured based on artificial intelligence. The aim of edge intelligence is to enhance the quality and speed of data processing and protect the privacy and security of the data. Although recently emerged, spanning the period from 2011 to now, this field of research has shown explosive growth over the past five years. In this paper, we present a thorough and comprehensive survey on the literature surrounding edge intelligence. We first identify four fundamental components of edge intelligence, namely edge caching, edge training, edge inference, and edge offloading, based on theoretical and practical results pertaining to proposed and deployed systems. We then aim for a systematic classification of the state of the solutions by examining research results and observations for each of the four components and present a taxonomy that includes practical problems, adopted techniques, and application goals. For each category, we elaborate, compare and analyse the literature from the perspectives of adopted techniques, objectives, performance, advantages and drawbacks, etc. This survey article provides a comprehensive introduction to edge intelligence and its application areas. In addition, we summarise the development of the emerging research field and the current state-of-the-art and discuss the important open issues and possible theoretical and technical solutions.
Corpus-based set expansion (i.e., finding the "complete" set of entities belonging to the same semantic class, based on a given corpus and a tiny set of seeds) is a critical task in knowledge discovery. It may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search. To discover new entities in an expanded set, previous approaches either make one-time entity ranking based on distributional similarity, or resort to iterative pattern-based bootstrapping. The core challenge for these methods is how to deal with noisy context features derived from free-text corpora, which may lead to entity intrusion and semantic drifting. In this study, we propose a novel framework, SetExpan, which tackles this problem, with two techniques: (1) a context feature selection method that selects clean context features for calculating entity-entity distributional similarity, and (2) a ranking-based unsupervised ensemble method for expanding entity set based on denoised context features. Experiments on three datasets show that SetExpan is robust and outperforms previous state-of-the-art methods in terms of mean average precision.
Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We show that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code ( //github.com/Wluper/edm ) and datasets ( //data.wluper.com ) are publicly available.
In recent years with the rise of Cloud Computing (CC), many companies providing services in the cloud, are empowered a new series of services to their catalog, such as data mining (DM) and data processing, taking advantage of the vast computing resources available to them. Different service definition proposals have been proposed to address the problem of describing services in CC in a comprehensive way. Bearing in mind that each provider has its own definition of the logic of its services, and specifically of DM services, it should be pointed out that the possibility of describing services in a flexible way between providers is fundamental in order to maintain the usability and portability of this type of CC services. The use of semantic technologies based on the proposal offered by Linked Data (LD) for the definition of services, allows the design and modelling of DM services, achieving a high degree of interoperability. In this article a schema for the definition of DM services on CC is presented, in addition are considered all key aspects of service in CC, such as prices, interfaces, Software Level Agreement, instances or workflow of experimentation, among others. The proposal presented is based on LD, so that it reuses other schemata obtaining a best definition of the service. For the validation of the schema, a series of DM services have been created where some of the best known algorithms such as \textit{Random Forest} or \textit{KMeans} are modeled as services.
Sentiment analysis is essential in many real-world applications such as stance detection, review analysis, recommendation system, and so on. Sentiment analysis becomes more difficult when the data is noisy and collected from social media. India is a multilingual country; people use more than one languages to communicate within themselves. The switching in between the languages is called code-switching or code-mixing, depending upon the type of mixing. This paper presents overview of the shared task on sentiment analysis of code-mixed data pairs of Hindi-English and Bengali-English collected from the different social media platform. The paper describes the task, dataset, evaluation, baseline and participant's systems.
We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk's latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk's representations can provide $F_1$ scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk's representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.