Mobile apps are extensively involved in cyber-crimes. Some apps are malware which compromise users' devices, while some others may lead to privacy leakage. Apart from them, there also exist apps which directly make profit from victims through deceiving, threatening or other criminal actions. We name these apps as CULPRITWARE. They have become emerging threats in recent years. However, the characteristics and the ecosystem of CULPRITWARE remain mysterious. This paper takes the first step towards systematically studying CULPRITWARE and its ecosystem. Specifically, we first characterize CULPRITWARE by categorizing and comparing them with benign apps and malware. The result shows that CULPRITWARE have unique features, e.g., the usage of app generators (25.27%) deviates from that of benign apps (5.08%) and malware (0.43%). Such a discrepancy can be used to distinguish CULPRITWARE from benign apps and malware. Then we understand the structure of the ecosystem by revealing the four participating entities (i.e., developer, agent, operator and reaper) and the workflow. After that, we further reveal the characteristics of the ecosystem by studying the participating entities. Our investigation shows that the majority of CULPRITWARE (at least 52.08%) are propagated through social media rather than the official app markets, and most CULPRITWARE (96%) indirectly rely on the covert fourth-party payment services to transfer the profits. Our findings shed light on the ecosystem, and can facilitate the community and law enforcement authorities to mitigate the threats. We will release the source code of our tools to engage the community.
AutoDRIVE is envisioned to be an integrated research and education platform for scaled autonomous vehicles and related applications. This work is a stepping-stone towards achieving the greater goal of realizing such a platform. Particularly, this work introduces the AutoDRIVE Simulator, a high-fidelity simulator for scaled autonomous vehicles. The proposed simulation ecosystem is developed atop the Unity game engine, and exploits its features in order to simulate realistic system dynamics and render photorealistic graphics. It comprises of a scaled vehicle model equipped with a comprehensive sensor suite for redundant perception, a set of actuators for constrained motion control and a fully functional lighting system for illumination and signaling. It also provides a modular environment development kit, which comprises of various environment modules that aid in reconfigurable construction of the scene. Additionally, the simulator features a communication bridge in order to extend an interface to the autonomous driving software stack developed independently by the users. This work describes some of the prominent components of this simulation system along with some key features that it has to offer in order to accelerate education and research aimed at autonomous driving.
Although many software development projects have moved their developer discussion forums to generic platforms such as Stack Overflow, Eclipse has been steadfast in hosting their self-supported community forums. While recent studies show forums share similarities to generic communication channels, it is unknown how project-specific forums are utilized. In this paper, we analyze 832,058 forum threads and their linkages to four systems with 2,170 connected contributors to understand the participation, content and sentiment. Results show that Seniors are the most active participants to respond bug and non-bug-related threads in the forums (i.e., 66.1% and 45.5%), and sentiment among developers are inconsistent while knowledge sharing within Eclipse. We recommend the users to identify appropriate topics and ask in a positive procedural way when joining forums. For developers, preparing project-specific forums could be an option to bridge the communication between members. Irrespective of the popularity of Stack Overflow, we argue the benefits of using project-specific forum initiatives, such as GitHub Discussions, are needed to cultivate a community and its ecosystem.
Reasoning about the future -- understanding how decisions in the present time affect outcomes in the future -- is one of the central challenges for reinforcement learning (RL), especially in highly-stochastic or partially observable environments. While predicting the future directly is hard, in this work we introduce a method that allows an agent to "look into the future" without explicitly predicting it. Namely, we propose to allow an agent, during its training on past experience, to observe what \emph{actually} happened in the future at that time, while enforcing an information bottleneck to avoid the agent overly relying on this privileged information. This gives our agent the opportunity to utilize rich and useful information about the future trajectory dynamics in addition to the present. Our method, Policy Gradients Incorporating the Future (PGIF), is easy to implement and versatile, being applicable to virtually any policy gradient algorithm. We apply our proposed method to a number of off-the-shelf RL algorithms and show that PGIF is able to achieve higher reward faster in a variety of online and offline RL domains, as well as sparse-reward and partially observable environments.
Sharing sensitive data is vital in enabling many modern data analysis and machine learning tasks. However, current methods for data release are insufficiently accurate or granular to provide meaningful utility, and they carry a high risk of deanonymization or membership inference attacks. In this paper, we propose a differentially private synthetic data generation solution with a focus on the compelling domain of location data. We present two methods with high practical utility for generating synthetic location data from real locations, both of which protect the existence and true location of each individual in the original dataset. Our first, partitioning-based approach introduces a novel method for privately generating point data using kernel density estimation, in addition to employing private adaptations of classic statistical techniques, such as clustering, for private partitioning. Our second, network-based approach incorporates public geographic information, such as the road network of a city, to constrain the bounds of synthetic data points and hence improve the accuracy of the synthetic data. Both methods satisfy the requirements of differential privacy, while also enabling accurate generation of synthetic data that aims to preserve the distribution of the real locations. We conduct experiments using three large-scale location datasets to show that the proposed solutions generate synthetic location data with high utility and strong similarity to the real datasets. We highlight some practical applications for our work by applying our synthetic data to a range of location analytics queries, and we demonstrate that our synthetic data produces near-identical answers to the same queries compared to when real data is used. Our results show that the proposed approaches are practical solutions for sharing and analyzing sensitive location data privately.
Debugging is an essential process with a large share of the development effort, being a relentless quest for offensive code through tracing, inspection and iterative running sessions. Probably every developer has been in a situation with a clear wish to rewind time just for a while, only to retry some actions alternatively, instead of restarting the entire session. Well, the genie to fulfill such a wish is known as a reverse debugger. Their inherent technical complexity makes them very hard to implement, while the imposed execution overhead turns them to less preferable for adoption. There are only a few available, most being off-line tools, working on recorded, previously run, sessions. We consider live reverse debuggers both challenging and promising, since they can fit into existing forward debuggers, and we developed the first live reverse debugger on top of LLDB, discussing in detail our implementation approach.
Merging at highway on-ramps while interacting with other human-driven vehicles is challenging for autonomous vehicles (AVs). An efficient route to this challenge requires exploring and exploiting knowledge of the interaction process from demonstrations by humans. However, it is unclear what information (or environmental states) is utilized by the human driver to guide their behavior throughout the whole merging process. This paper provides quantitative analysis and evaluation of the merging behavior at highway on-ramps with congested traffic in a volume of time and space. Two types of social interaction scenarios are considered based on the social preferences of surrounding vehicles: courteous and rude. The significant levels of environmental states for characterizing the interactive merging process are empirically analyzed based on the real-world INTERACTION dataset. Experimental results reveal two fundamental mechanisms in the merging process: 1) Human drivers select different states to make sequential decisions at different moments of task execution, and 2) the social preference of surrounding vehicles can impact variable selection for making decisions. It implies that efficient decision-making design should filter out irrelevant information while considering social preference to achieve comparable human-level performance. These essential findings shed light on developing new decision-making approaches for AVs.
With the availability of large datasets and ever-increasing computing power, there has been a growing use of data-driven artificial intelligence systems, which have shown their potential for successful application in diverse areas. However, many of these systems are not able to provide information about the rationale behind their decisions to their users. Lack of understanding of such decisions can be a major drawback, especially in critical domains such as those related to cybersecurity. In light of this problem, in this paper we make three contributions: (i) proposal and discussion of desiderata for the explanation of outputs generated by AI-based cybersecurity systems; (ii) a comparative analysis of approaches in the literature on Explainable Artificial Intelligence (XAI) under the lens of both our desiderata and further dimensions that are typically used for examining XAI approaches; and (iii) a general architecture that can serve as a roadmap for guiding research efforts towards the development of explainable AI-based cybersecurity systems -- at its core, this roadmap proposes combinations of several research lines in a novel way towards tackling the unique challenges that arise in this context.
In the past few decades, artificial intelligence (AI) technology has experienced swift developments, changing everyone's daily life and profoundly altering the course of human society. The intention of developing AI is to benefit humans, by reducing human labor, bringing everyday convenience to human lives, and promoting social good. However, recent research and AI applications show that AI can cause unintentional harm to humans, such as making unreliable decisions in safety-critical scenarios or undermining fairness by inadvertently discriminating against one group. Thus, trustworthy AI has attracted immense attention recently, which requires careful consideration to avoid the adverse effects that AI may bring to humans, so that humans can fully trust and live in harmony with AI technologies. Recent years have witnessed a tremendous amount of research on trustworthy AI. In this survey, we present a comprehensive survey of trustworthy AI from a computational perspective, to help readers understand the latest technologies for achieving trustworthy AI. Trustworthy AI is a large and complex area, involving various dimensions. In this work, we focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems. We also discuss the accordant and conflicting interactions among different dimensions and discuss potential aspects for trustworthy AI to investigate in the future.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.
Methods proposed in the literature towards continual deep learning typically operate in a task-based sequential learning setup. A sequence of tasks is learned, one at a time, with all data of current task available but not of previous or future tasks. Task boundaries and identities are known at all times. This setup, however, is rarely encountered in practical applications. Therefore we investigate how to transform continual learning to an online setup. We develop a system that keeps on learning over time in a streaming fashion, with data distributions gradually changing and without the notion of separate tasks. To this end, we build on the work on Memory Aware Synapses, and show how this method can be made online by providing a protocol to decide i) when to update the importance weights, ii) which data to use to update them, and iii) how to accumulate the importance weights at each update step. Experimental results show the validity of the approach in the context of two applications: (self-)supervised learning of a face recognition model by watching soap series and learning a robot to avoid collisions.