To represent the complex individual interactions in the dynamics of disease spread informed by data, the coupling of an epidemiological agent-based model with the ensemble Kalman filter is proposed. The statistical inference of the propagation of a disease by means of ensemble-based data assimilation systems has been studied in previous work. The models used are mostly compartmental models representing the mean field evolution through ordinary differential equations. These techniques allow to monitor the propagation of the infections from data and to estimate several parameters of epidemiological interest. However, there are many important features which are based on the individual interactions that cannot be represented in the mean field equations, such as social network and bubbles, contact tracing, isolating individuals in risk, and social network-based distancing strategies. Agent-based models can describe contact networks at an individual level, including demographic attributes such as age, neighbourhood, household, workplaces, schools, entertainment places, among others. Nevertheless, these models have several unknown parameters which are thus difficult to estimate. In this work, we propose the use of ensemble-based data assimilation techniques to calibrate an agent-based model using daily epidemiological data. This raises the challenge of having to adapt the agent populations to incorporate the information provided by the coarse-grained data. To do this, two stochastic strategies to correct the model predictions are developed. The ensemble Kalman filter with perturbed observations is used for the joint estimation of the state and some key epidemiological parameters. We conduct experiments with an agent based-model designed for COVID-19 and assess the proposed methodology on synthetic data and on COVID-19 daily reports from Ciudad Aut\'onoma de Buenos Aires, Argentina.
Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
Linear dynamical systems are canonical models for learning-based control of plants with uncertain dynamics. The setting consists of a stochastic differential equation that captures the state evolution of the plant understudy, while the true dynamics matrices are unknown and need to be learned from the observed data of state trajectory. An important issue is to ensure that the system is stabilized and destabilizing control actions due to model uncertainties are precluded as soon as possible. A reliable stabilization procedure for this purpose that can effectively learn from unstable data to stabilize the system in a finite time is not currently available. In this work, we propose a novel Bayesian learning algorithm that stabilizes unknown continuous-time stochastic linear systems. The presented algorithm is flexible and exposes effective stabilization performance after a remarkably short time period of interacting with the system.
Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have "spurious" instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word "amazing" on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.
The literature on the relationship between environmental factors such as climatic changes and natural hazards and human mobility (both internal and international) is characterized by heterogeneous results: some contributions highlight the role of climate changes as a driver of migratory flows, while others underline how this impact is mediated by geographical, economic and the features of the environmental shock. This paper attempts to map this literature, focusing on economics and empirical essays. The paper improves on the existing literature: (a) providing systematic research of the literature through main bibliographic databases, followed by a review and bibliometric analysis of all resulting papers; (b) building a citation-based network of contributions, that hollows to identify four separate clusters of paper; (c) applying meta-analysis methods on the sample of 96 papers released between 2003 and 2020, published in an academic journal, working papers series or unpublished studies, providing 3,904 point estimates of the effect of slow-onset events and 2,065 point estimates of the effect of fast-onset events. Overall, the meta-analytic average effect estimates a small impact of slow- and rapid-onset variables on migration, however positive and significant. When the clustering of the literature is accounted for, however, a significant heterogeneity emerges among the four clusters of papers, giving rise to new evidence on the formation of club-like convergence of literature outcomes.
The task of Knowledge Graph Completion (KGC) aims to automatically infer the missing fact information in Knowledge Graph (KG). In this paper, we take a new perspective that aims to leverage rich user-item interaction data (user interaction data for short) for improving the KGC task. Our work is inspired by the observation that many KG entities correspond to online items in application systems. However, the two kinds of data sources have very different intrinsic characteristics, and it is likely to hurt the original performance using simple fusion strategy. To address this challenge, we propose a novel adversarial learning approach by leveraging user interaction data for the KGC task. Our generator is isolated from user interaction data, and serves to improve the performance of the discriminator. The discriminator takes the learned useful information from user interaction data as input, and gradually enhances the evaluation capacity in order to identify the fake samples generated by the generator. To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks, which will be jointly optimized with the discriminator. Such an approach is effective to alleviate the issues about data heterogeneity and semantic complexity for the KGC task. Extensive experiments on three real-world datasets have demonstrated the effectiveness of our approach on the KGC task.
This paper presents our semantic parsing system for the evaluation task of open domain semantic parsing in NLPCC 2019. Many previous works formulate semantic parsing as a sequence-to-sequence(seq2seq) problem. Instead, we treat the task as a sketch-based problem in a coarse-to-fine(coarse2fine) fashion. The sketch is a high-level structure of the logical form exclusive of low-level details such as entities and predicates. In this way, we are able to optimize each part individually. Specifically, we decompose the process into three stages: the sketch classification determines the high-level structure while the entity labeling and the matching network fill in missing details. Moreover, we adopt the seq2seq method to evaluate logical form candidates from an overall perspective. The co-occurrence relationship between predicates and entities contribute to the reranking as well. Our submitted system achieves the exactly matching accuracy of 82.53% on full test set and 47.83% on hard test subset, which is the 3rd place in NLPCC 2019 Shared Task 2. After optimizations for parameters, network structure and sampling, the accuracy reaches 84.47% on full test set and 63.08% on hard test subset(Our code and data are available at //github.com/zechagl/NLPCC2019-Semantic-Parsing).
The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations. Most methods based on image-level annotations use localization maps obtained from the classifier, but these only focus on the small discriminative parts of objects and do not capture precise boundaries. FickleNet explores diverse combinations of locations on feature maps created by generic deep neural networks. It selects hidden units randomly and then uses them to obtain activation scores for image classification. FickleNet implicitly learns the coherence of each location in the feature maps, resulting in a localization map which identifies both discriminative and other parts of objects. The ensemble effects are obtained from a single network by selecting random hidden unit pairs, which means that a variety of localization maps are generated from a single image. Our approach does not require any additional training steps and only adds a simple layer to a standard convolutional neural network; nevertheless it outperforms recent comparable techniques on the Pascal VOC 2012 benchmark in both weakly and semi-supervised settings.
In recent years with the rise of Cloud Computing (CC), many companies providing services in the cloud, are empowered a new series of services to their catalog, such as data mining (DM) and data processing, taking advantage of the vast computing resources available to them. Different service definition proposals have been proposed to address the problem of describing services in CC in a comprehensive way. Bearing in mind that each provider has its own definition of the logic of its services, and specifically of DM services, it should be pointed out that the possibility of describing services in a flexible way between providers is fundamental in order to maintain the usability and portability of this type of CC services. The use of semantic technologies based on the proposal offered by Linked Data (LD) for the definition of services, allows the design and modelling of DM services, achieving a high degree of interoperability. In this article a schema for the definition of DM services on CC is presented, in addition are considered all key aspects of service in CC, such as prices, interfaces, Software Level Agreement, instances or workflow of experimentation, among others. The proposal presented is based on LD, so that it reuses other schemata obtaining a best definition of the service. For the validation of the schema, a series of DM services have been created where some of the best known algorithms such as \textit{Random Forest} or \textit{KMeans} are modeled as services.
Using the 6,638 case descriptions of societal impact submitted for evaluation in the Research Excellence Framework (REF 2014), we replicate the topic model (Latent Dirichlet Allocation or LDA) made in this context and compare the results with factor-analytic results using a traditional word-document matrix (Principal Component Analysis or PCA). Removing a small fraction of documents from the sample, for example, has on average a much larger impact on LDA than on PCA-based models to the extent that the largest distortion in the case of PCA has less effect than the smallest distortion of LDA-based models. In terms of semantic coherence, however, LDA models outperform PCA-based models. The topic models inform us about the statistical properties of the document sets under study, but the results are statistical and should not be used for a semantic interpretation - for example, in grant selections and micro-decision making, or scholarly work-without follow-up using domain-specific semantic maps.
An interactive image retrieval system learns which images in the database belong to a user's query concept, by analyzing the example images and feedback provided by the user. The challenge is to retrieve the relevant images with minimal user interaction. In this work, we propose to solve this problem by posing it as a binary classification task of classifying all images in the database as being relevant or irrelevant to the user's query concept. Our method combines active learning with graph-based semi-supervised learning (GSSL) to tackle this problem. Active learning reduces the number of user interactions by querying the labels of the most informative points and GSSL allows to use abundant unlabeled data along with the limited labeled data provided by the user. To efficiently find the most informative point, we use an uncertainty sampling based method that queries the label of the point nearest to the decision boundary of the classifier. We estimate this decision boundary using our heuristic of adaptive threshold. To utilize huge volumes of unlabeled data we use an efficient approximation based method that reduces the complexity of GSSL from $O(n^3)$ to $O(n)$, making GSSL scalable. We make the classifier robust to the diversity and noisy labels associated with images in large databases by incorporating information from multiple modalities such as visual information extracted from deep learning based models and semantic information extracted from the WordNet. High F1 scores within few relevance feedback rounds in our experiments with concepts defined on AnimalWithAttributes and Imagenet (1.2 million images) datasets indicate the effectiveness and scalability of our approach.