While the entire field of privacy preserving data analytics is focused on the privacy-utility tradeoff, recent work has shown that privacy preserving data publishing can introduce different levels of utility across different population groups. It is important to understand this new tradeoff between privacy and equity as privacy technology is being deployed in situations where the data products will be used for research and policy making. Will marginal populations see disproportionately less utility from privacy technology? If there is an inequity how can we address it?
Garg, Goldwasser and Vasudevan (Eurocrypt 2020) invented the notion of deletion-compliance to formally model the "right to be forgotten", a concept that confers individuals more control over their digital data. A requirement of deletion-compliance is strong privacy for the deletion requesters since no outside observer must be able to tell if deleted data was ever present in the first place. Naturally, many real world systems where information can flow across users are automatically ruled out. The main thesis of this paper is that deletion-compliance is a standalone notion, distinct from privacy. We present an alternative definition that meaningfully captures deletion-compliance without any privacy implications. This allows broader class of data collectors to demonstrate compliance to deletion requests and to be paired with various notions of privacy. Our new definition has several appealing properties: - It is implied by the stronger definition of Garg et al. under natural conditions, and is equivalent when we add a privacy requirement. - It is naturally composable with minimal assumptions. - Its requirements are met by data structure implementations that do not reveal the order of operations, a concept known as history-independence. Along the way, we discuss the many challenges that remain in providing a universal definition of compliance to the "right to be forgotten."
We consider a platform's problem of collecting data from privacy sensitive users to estimate an underlying parameter of interest. We formulate this question as a Bayesian-optimal mechanism design problem, in which an individual can share her (verifiable) data in exchange for a monetary reward or services, but at the same time has a (private) heterogeneous privacy cost which we quantify using differential privacy. We consider two popular differential privacy settings for providing privacy guarantees for the users: central and local. In both settings, we establish minimax lower bounds for the estimation error and derive (near) optimal estimators for given heterogeneous privacy loss levels for users. Building on this characterization, we pose the mechanism design problem as the optimal selection of an estimator and payments that will elicit truthful reporting of users' privacy sensitivities. Under a regularity condition on the distribution of privacy sensitivities we develop efficient algorithmic mechanisms to solve this problem in both privacy settings. Our mechanism in the central setting can be implemented in time $\mathcal{O}(n \log n)$ where $n$ is the number of users and our mechanism in the local setting admits a Polynomial Time Approximation Scheme (PTAS).
Embedded systems demand on-device processing of data using Neural Networks (NNs) while conforming to the memory, power and computation constraints, leading to an efficiency and accuracy tradeoff. To bring NNs to edge devices, several optimizations such as model compression through pruning, quantization, and off-the-shelf architectures with efficient design have been extensively adopted. These algorithms when deployed to real world sensitive applications, requires to resist inference attacks to protect privacy of users training data. However, resistance against inference attacks is not accounted for designing NN models for IoT. In this work, we analyse the three-dimensional privacy-accuracy-efficiency tradeoff in NNs for IoT devices and propose Gecko training methodology where we explicitly add resistance to private inferences as a design objective. We optimize the inference-time memory, computation, and power constraints of embedded devices as a criterion for designing NN architecture while also preserving privacy. We choose quantization as design choice for highly efficient and private models. This choice is driven by the observation that compressed models leak more information compared to baseline models while off-the-shelf efficient architectures indicate poor efficiency and privacy tradeoff. We show that models trained using Gecko methodology are comparable to prior defences against black-box membership attacks in terms of accuracy and privacy while providing efficiency.
The applicability of process mining techniques hinges on the availability of event logs capturing the execution of a business process. In some use cases, particularly those involving customer-facing processes, these event logs may contain private information. Data protection regulations restrict the use of such event logs for analysis purposes. One way of circumventing these restrictions is to anonymize the event log to the extent that no individual can be singled out using the anonymized log. This article addresses the problem of anonymizing an event log in order to guarantee that, upon release of the anonymized log, the probability that an attacker may single out any individual represented in the original log does not increase by more than a threshold. The article proposes a differentially private release mechanism, which samples the cases in the log and adds noise to the timestamps to the extent required to achieve the above privacy guarantee. The article reports on an empirical comparison of the proposed approach against the state-of-the-art approaches using 14 real-life event logs in terms of data utility loss and computational efficiency.
Historically, machine learning methods have not been designed with security in mind. In turn, this has given rise to adversarial examples, carefully perturbed input samples aimed to mislead detection at test time, which have been applied to attack spam and malware classification, and more recently to attack image classification. Consequently, an abundance of research has been devoted to designing machine learning methods that are robust to adversarial examples. Unfortunately, there are desiderata besides robustness that a secure and safe machine learning model must satisfy, such as fairness and privacy. Recent work by Song et al. (2019) has shown, empirically, that there exists a trade-off between robust and private machine learning models. Models designed to be robust to adversarial examples often overfit on training data to a larger extent than standard (non-robust) models. If a dataset contains private information, then any statistical test that separates training and test data by observing a model's outputs can represent a privacy breach, and if a model overfits on training data, these statistical tests become easier. In this work, we identify settings where standard models will overfit to a larger extent in comparison to robust models, and as empirically observed in previous works, settings where the opposite behavior occurs. Thus, it is not necessarily the case that privacy must be sacrificed to achieve robustness. The degree of overfitting naturally depends on the amount of data available for training. We go on to characterize how the training set size factors into the privacy risks exposed by training a robust model on a simple Gaussian data task, and show empirically that our findings hold on image classification benchmark datasets, such as CIFAR-10 and CIFAR-100.
We study the accuracy of differentially private mechanisms in the continual release model. A continual release mechanism receives a sensitive dataset as a stream of $T$ inputs and produces, after receiving each input, an accurate output on the obtained inputs.In contrast, a batch algorithm receives the data as one batch and produces a single output. We provide the first strong lower bounds on the error of continual release mechanisms. In particular, for two fundamental problems that are widely studied and used in the batch model, we show that the worst case error of every continual release algorithm is $\tilde \Omega(T^{1/3})$ times larger than that of the best batch algorithm. Previous work shows only a polylogarithimic (in $T$) gap between the worst case error achievable in these two models; further, for many problems, including the summation of binary attributes, the polylogarithmic gap is tight (Dwork et al., 2010; Chan et al., 2010). Our results show that problems closely related to summation -- specifically, those that require selecting the largest of a set of sums -- are fundamentally harder in the continual release model than in the batch model. Our lower bounds assume only that privacy holds for streams fixed in advance (the "nonadaptive" setting). However, we provide matching upper bounds that hold in a model where privacy is required even for adaptively selected streams. This model may be of independent interest.
Decentralized learning involves training machine learning models over remote mobile devices, edge servers, or cloud servers while keeping data localized. Even though many studies have shown the feasibility of preserving privacy, enhancing training performance or introducing Byzantine resilience, but none of them simultaneously considers all of them. Therefore we face the following problem: \textit{how can we efficiently coordinate the decentralized learning process while simultaneously maintaining learning security and data privacy?} To address this issue, in this paper we propose SPDL, a blockchain-secured and privacy-preserving decentralized learning scheme. SPDL integrates blockchain, Byzantine Fault-Tolerant (BFT) consensus, BFT Gradients Aggregation Rule (GAR), and differential privacy seamlessly into one system, ensuring efficient machine learning while maintaining data privacy, Byzantine fault tolerance, transparency, and traceability. To validate our scheme, we provide rigorous analysis on convergence and regret in the presence of Byzantine nodes. We also build a SPDL prototype and conduct extensive experiments to demonstrate that SPDL is effective and efficient with strong security and privacy guarantees.
Deep learning has achieved great success in many applications. However, its deployment in practice has been hurdled by two issues: the privacy of data that has to be aggregated centrally for model training and high communication overhead due to transmission of a large amount of data usually geographically distributed. Addressing both issues is challenging and most existing works could not provide an efficient solution. In this paper, we develop FedPC, a Federated Deep Learning Framework for Privacy Preservation and Communication Efficiency. The framework allows a model to be learned on multiple private datasets while not revealing any information of training data, even with intermediate data. The framework also minimizes the amount of data exchanged to update the model. We formally prove the convergence of the learning model when training with FedPC and its privacy-preserving property. We perform extensive experiments to evaluate the performance of FedPC in terms of the approximation to the upper-bound performance (when training centrally) and communication overhead. The results show that FedPC maintains the performance approximation of the models within $8.5\%$ of the centrally-trained models when data is distributed to 10 computing nodes. FedPC also reduces the communication overhead by up to $42.20\%$ compared to existing works.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.
Federated learning has been showing as a promising approach in paving the last mile of artificial intelligence, due to its great potential of solving the data isolation problem in large scale machine learning. Particularly, with consideration of the heterogeneity in practical edge computing systems, asynchronous edge-cloud collaboration based federated learning can further improve the learning efficiency by significantly reducing the straggler effect. Despite no raw data sharing, the open architecture and extensive collaborations of asynchronous federated learning (AFL) still give some malicious participants great opportunities to infer other parties' training data, thus leading to serious concerns of privacy. To achieve a rigorous privacy guarantee with high utility, we investigate to secure asynchronous edge-cloud collaborative federated learning with differential privacy, focusing on the impacts of differential privacy on model convergence of AFL. Formally, we give the first analysis on the model convergence of AFL under DP and propose a multi-stage adjustable private algorithm (MAPA) to improve the trade-off between model utility and privacy by dynamically adjusting both the noise scale and the learning rate. Through extensive simulations and real-world experiments with an edge-could testbed, we demonstrate that MAPA significantly improves both the model accuracy and convergence speed with sufficient privacy guarantee.