Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g., identity and emotion, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a legitimate speaker. Conversely, such techniques have been used to generate privacy-transformed speech by suppressing personally identifiable attributes in the voice signals, achieving anonymization. Prior works have studied the security and privacy vectors in parallel, and thus it raises alarm that if a benign user can achieve privacy by a transformation, it also means that a malicious user can break security by bypassing the anti-spoofing mechanism. In this paper, we take a step towards balancing two seemingly conflicting requirements: security and privacy. It remains unclear what the vulnerabilities in one domain imply for the other, and what dynamic interactions exist between them. A better understanding of these aspects is crucial for assessing and mitigating vulnerabilities inherent with VUIs and building effective defenses. In this paper,(i) we investigate the applicability of the current voice anonymization methods by deploying a tandem framework that jointly combines anti-spoofing and authentication models, and evaluate the performance of these methods;(ii) examining analytical and empirical evidence, we reveal a duality between the two mechanisms as they offer different ways to achieve the same objective, and we show that leveraging one vector significantly amplifies the effectiveness of the other;(iii) we demonstrate that to effectively defend from potential attacks against VUIs, it is necessary to investigate the attacks from multiple complementary perspectives(security and privacy).
Machine learning (ML) is increasingly being adopted in a wide variety of application domains. Usually, a well-performing ML model relies on a large volume of training data and high-powered computational resources. Such a need for and the use of huge volumes of data raise serious privacy concerns because of the potential risks of leakage of highly privacy-sensitive information; further, the evolving regulatory environments that increasingly restrict access to and use of privacy-sensitive data add significant challenges to fully benefiting from the power of ML for data-driven applications. A trained ML model may also be vulnerable to adversarial attacks such as membership, attribute, or property inference attacks and model inversion attacks. Hence, well-designed privacy-preserving ML (PPML) solutions are critically needed for many emerging applications. Increasingly, significant research efforts from both academia and industry can be seen in PPML areas that aim toward integrating privacy-preserving techniques into ML pipeline or specific algorithms, or designing various PPML architectures. In particular, existing PPML research cross-cut ML, systems and applications design, as well as security and privacy areas; hence, there is a critical need to understand state-of-the-art research, related challenges and a research roadmap for future research in PPML area. In this paper, we systematically review and summarize existing privacy-preserving approaches and propose a Phase, Guarantee, and Utility (PGU) triad based model to understand and guide the evaluation of various PPML solutions by decomposing their privacy-preserving functionalities. We discuss the unique characteristics and challenges of PPML and outline possible research directions that leverage as well as benefit multiple research communities such as ML, distributed systems, security and privacy.
Smart Meters (SMs) are a fundamental component of smart grids, but they carry sensitive information about users such as occupancy status of houses and therefore, they have raised serious concerns about leakage of consumers' private information. In particular, we focus on real-time privacy threats, i.e., potential attackers that try to infer sensitive data from SMs reported data in an online fashion. We adopt an information-theoretic privacy measure and show that it effectively limits the performance of any real-time attacker. Using this privacy measure, we propose a general formulation to design a privatization mechanism that can provide a target level of privacy by adding a minimal amount of distortion to the SMs measurements. On the other hand, to cope with different applications, a flexible distortion measure is considered. This formulation leads to a general loss function, which is optimized using a deep learning adversarial framework, where two neural networks $-$ referred to as the releaser and the adversary $-$ are trained with opposite goals. An exhaustive empirical study is then performed to validate the performances of the proposed approach for the occupancy detection privacy problem, assuming the attacker disposes of either limited or full access to the training dataset.
The explosion of data collection has raised serious privacy concerns in users due to the possibility that sharing data may also reveal sensitive information. The main goal of a privacy-preserving mechanism is to prevent a malicious third party from inferring sensitive information while keeping the shared data useful. In this paper, we study this problem in the context of time series data and smart meters (SMs) power consumption measurements in particular. Although Mutual Information (MI) between private and released variables has been used as a common information-theoretic privacy measure, it fails to capture the causal time dependencies present in the power consumption time series data. To overcome this limitation, we introduce the Directed Information (DI) as a more meaningful measure of privacy in the considered setting and propose a novel loss function. The optimization is then performed using an adversarial framework where two Recurrent Neural Networks (RNNs), referred to as the releaser and the adversary, are trained with opposite goals. Our empirical studies on real-world data sets from SMs measurements in the worst-case scenario where an attacker has access to all the training data set used by the releaser, validate the proposed method and show the existing trade-offs between privacy and utility.
Differential privacy (DP) has been widely used to protect the privacy of confidential cyber physical energy systems (CPES) data. However, applying DP without analyzing the utility, privacy, and security requirements can affect the data utility as well as help the attacker to conduct integrity attacks (e.g., False Data Injection(FDI)) leveraging the differentially private data. Existing anomaly-detection-based defense strategies against data integrity attacks in DP-based smart grids fail to minimize the attack impact while maximizing data privacy and utility. To address this challenge, it is nontrivial to apply a defensive approach during the design process. In this paper, we formulate and develop the defense strategy as a part of the design process to investigate data privacy, security, and utility in a DP-based smart grid network. We have proposed a provable relationship among the DP-parameters that enables the defender to design a fault-tolerant system against FDI attacks. To experimentally evaluate and prove the effectiveness of our proposed design approach, we have simulated the FDI attack in a DP-based grid. The evaluation indicates that the attack impact can be minimized if the designer calibrates the privacy level according to the proposed correlation of the DP-parameters to design the grid network. Moreover, we analyze the feasibility of the DP mechanism and QoS of the smart grid network in an adversarial setting. Our analysis suggests that the DP mechanism is feasible over existing privacy-preserving mechanisms in the smart grid domain. Also, the QoS of the differentially private grid applications is found satisfactory in adversarial presence.
We develop an assisted learning framework for assisting organization-level learners to improve their learning performance with limited and imbalanced data. In particular, learners at the organization level usually have sufficient computation resource, but are subject to stringent collaboration policy and information privacy. Their limited imbalanced data often cause biased inference and sub-optimal decision-making. In our assisted learning framework, an organizational learner purchases assistance service from a service provider and aims to enhance its model performance within a few assistance rounds. We develop effective stochastic training algorithms for assisted deep learning and assisted reinforcement learning. Different from existing distributed algorithms that need to frequently transmit gradients or models, our framework allows the learner to only occasionally share information with the service provider, and still achieve a near-oracle model as if all the data were centralized.
Industry 4.0 embodies one of the significant technological changes of this decade. Cyber-physical systems and the Internet Of Things are two central technologies in this change that embed or connect with sensors and actuators and interact with the physical environment. However, such systems-of-systems undergo additional restrictions in an endeavor to maintain reliability and security when building and interconnecting components to a heterogeneous, multi-domain \textit{Smart-*} systems architecture. This paper presents an application-specific, layer-based approach to an offline security analysis inspired by design science that merges preceding expertise from relevant domains. With the example of a Smart-lighting system, we create a dedicated unified taxonomy for the use case and analyze its distributed Smart-* architecture by multiple layer-based models. We derive potential attacks from the system specifications in an iterative and incremental process and discuss resulting threats and vulnerabilities. Finally, we suggest immediate countermeasures for the latter potential multiple-domain security concerns.
The proliferation of harmful content on online social media platforms has necessitated empirical understandings of experiences of harm online and the development of practices for harm mitigation. Both understandings of harm and approaches to mitigating that harm, often through content moderation, have implicitly embedded frameworks of prioritization - what forms of harm should be researched, how policy on harmful content should be implemented, and how harmful content should be moderated. To aid efforts of better understanding the variety of online harms, how they relate to one another, and how to prioritize harms relevant to research, policy, and practice, we present a theoretical framework of severity for harmful online content. By employing a grounded theory approach, we developed a framework of severity based on interviews and card-sorting activities conducted with 52 participants over the course of ten months. Through our analysis, we identified four Types of Harm (physical, emotional, relational, and financial) and eight Dimensions along which the severity of harm can be understood (perspectives, intent, agency, experience, scale, urgency, vulnerability, sphere). We describe how our framework can be applied to both research and policy settings towards deeper understandings of specific forms of harm (e.g., harassment) and prioritization frameworks when implementing policies encompassing many forms of harm.
In order to preserve the possibility of an Internet that is free at the point of use, attention is turning to new solutions that would allow targeted advertisement delivery based on behavioral information such as user preferences, without compromising user privacy. Recently, explorations in devising such systems either take approaches that rely on semantic guarantees like $k$-anonymity -- which can be easily subverted when combining with alternative information, and do not take into account the possibility that even knowledge of such clusters is privacy-invasive in themselves. Other approaches provide full privacy by moving all data and processing logic to clients -- but which is prohibitively expensive for both clients and servers. In this work, we devise a new framework called PrivateFetch for building practical ad-delivery pipelines that rely on cryptographic hardness and best-case privacy, rather than syntactic privacy guarantees or reliance on real-world anonymization tools. PrivateFetch utilizes local computation of preferences followed by high-performance single-server private information retrieval (PIR) to ensure that clients can pre-fetch ad content from servers, without revealing any of their inherent characteristics to the content provider. When considering an database of $>1,000,000$ ads, we show that we can deliver $30$ ads to a client in 40 seconds, with total communication costs of 192KB. We also demonstrate the feasibility of PrivateFetch by showing that the monetary cost of running it is less than 1% of average ad revenue. As such, our system is capable of pre-fetching ads for clients based on behavioral and contextual user information, before displaying them during a typical browsing session. In addition, while we test PrivateFetch as a private ad-delivery, the generality of our approach means that it could also be used for other content types.
A key challenge of big data analytics is how to collect a large volume of (labeled) data. Crowdsourcing aims to address this challenge via aggregating and estimating high-quality data (e.g., sentiment label for text) from pervasive clients/users. Existing studies on crowdsourcing focus on designing new methods to improve the aggregated data quality from unreliable/noisy clients. However, the security aspects of such crowdsourcing systems remain under-explored to date. We aim to bridge this gap in this work. Specifically, we show that crowdsourcing is vulnerable to data poisoning attacks, in which malicious clients provide carefully crafted data to corrupt the aggregated data. We formulate our proposed data poisoning attacks as an optimization problem that maximizes the error of the aggregated data. Our evaluation results on one synthetic and two real-world benchmark datasets demonstrate that the proposed attacks can substantially increase the estimation errors of the aggregated data. We also propose two defenses to reduce the impact of malicious clients. Our empirical results show that the proposed defenses can substantially reduce the estimation errors of the data poisoning attacks.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.