Recent advances in generating synthetic data that allow to add principled ways of protecting privacy -- such as Differential Privacy -- are a crucial step in sharing statistical information in a privacy preserving way. But while the focus has been on privacy guarantees, the resulting private synthetic data is only useful if it still carries statistical information from the original data. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. What is it that data analysts want? Acknowledging that data quality is a subjective concept, we develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective. Data quality can be measured along two dimensions. First, quality of synthetic data can be evaluated against training data or against an underlying population. Second, the quality of synthetic data depends on general similarity of distributions or specific tasks such as inference or prediction. It is clear that accommodating all goals at once is a formidable challenge. We invite the academic community to jointly advance the privacy-quality frontier.
In statistical learning and analysis from shared data, which is increasingly widely adopted in platforms such as federated learning and meta-learning, there are two major concerns: privacy and robustness. Each participating individual should be able to contribute without the fear of leaking one's sensitive information. At the same time, the system should be robust in the presence of malicious participants inserting corrupted data. Recent algorithmic advances in learning from shared data focus on either one of these threats, leaving the system vulnerable to the other. We bridge this gap for the canonical problem of estimating the mean from i.i.d. samples. We introduce PRIME, which is the first efficient algorithm that achieves both privacy and robustness for a wide range of distributions. We further complement this result with a novel exponential time algorithm that improves the sample complexity of PRIME, achieving a near-optimal guarantee and matching a known lower bound for (non-robust) private mean estimation. This proves that there is no extra statistical cost to simultaneously guaranteeing privacy and robustness.
Modern cars technologies are evolving quickly. They collect a variety of personal data and treat it on behalf of the car manufacturer to improve the drivers' experience. The precise terms of such a treatment are stated within the privacy policies accepted by the user when buying a car or through the infotainment system when it is first started. This paper uses a double lens to assess people's privacy while they drive a car. The first approach is objective and studies the readability of privacy policies that comes with cars. We analyse the privacy policies of twelve car brands and apply well-known readability indices to evaluate the extent to which privacy policies are comprehensible by all drivers. The second approach targets drivers' opinions to extrapolate their privacy concerns and trust perceptions. We design a questionnaire to collect the opinions of 88 participants and draw essential statistics about them. Our combined findings indicate that privacy is insufficiently understood at present as an issue deriving from driving a car, hence future technologies should be tailored to make people more aware of the issue and to enable them to express their preferences.
In this paper, we provide an optimal additive noise mechanism for database queries with discrete answers on a finite support. The noise provides the minimum error rate for a given $(\epsilon,\delta)$ pair. Popular schemes apply random additive noise with infinite support and then clamp the resulting query response to the desired range. Clamping, unfortunately, compromises the privacy guarantees. Using modulo addition, rather than clamping, we show that, for any $(\epsilon,\delta)$ pair, the optimum additive noise distribution can be obtained by solving a mixed integer linear program (MILP). Having introduced our optimal noise design formulation, we derive closed form solutions for the optimal noise probability mass function (PMF) and the probability of error for two special cases. In our performance studies, we show that the proposed optimal mechanism outperforms state of the art for a given probability of error and for any budget $\epsilon >0$.
Knowledge graph embedding plays an important role in knowledge representation, reasoning, and data mining applications. However, for multiple cross-domain knowledge graphs, state-of-the-art embedding models cannot make full use of the data from different knowledge domains while preserving the privacy of exchanged data. In addition, the centralized embedding model may not scale to the extensive real-world knowledge graphs. Therefore, we propose a novel decentralized scalable learning framework, \emph{Federated Knowledge Graphs Embedding} (FKGE), where embeddings from different knowledge graphs can be learnt in an asynchronous and peer-to-peer manner while being privacy-preserving. FKGE exploits adversarial generation between pairs of knowledge graphs to translate identical entities and relations of different domains into near embedding spaces. In order to protect the privacy of the training data, FKGE further implements a privacy-preserving neural network structure to guarantee no raw data leakage. We conduct extensive experiments to evaluate FKGE on 11 knowledge graphs, demonstrating a significant and consistent improvement in model quality with at most 17.85\% and 7.90\% increases in performance on triple classification and link prediction tasks.
Adversarial training is among the most effective techniques to improve the robustness of models against adversarial perturbations. However, the full effect of this approach on models is not well understood. For example, while adversarial training can reduce the adversarial risk (prediction error against an adversary), it sometimes increase standard risk (generalization error when there is no adversary). Even more, such behavior is impacted by various elements of the learning problem, including the size and quality of training data, specific forms of adversarial perturbations in the input, model overparameterization, and adversary's power, among others. In this paper, we focus on \emph{distribution perturbing} adversary framework wherein the adversary can change the test distribution within a neighborhood of the training data distribution. The neighborhood is defined via Wasserstein distance between distributions and the radius of the neighborhood is a measure of adversary's manipulative power. We study the tradeoff between standard risk and adversarial risk and derive the Pareto-optimal tradeoff, achievable over specific classes of models, in the infinite data limit with features dimension kept fixed. We consider three learning settings: 1) Regression with the class of linear models; 2) Binary classification under the Gaussian mixtures data model, with the class of linear classifiers; 3) Regression with the class of random features model (which can be equivalently represented as two-layer neural network with random first-layer weights). We show that a tradeoff between standard and adversarial risk is manifested in all three settings. We further characterize the Pareto-optimal tradeoff curves and discuss how a variety of factors, such as features correlation, adversary's power or the width of two-layer neural network would affect this tradeoff.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.
Train machine learning models on sensitive user data has raised increasing privacy concerns in many areas. Federated learning is a popular approach for privacy protection that collects the local gradient information instead of real data. One way to achieve a strict privacy guarantee is to apply local differential privacy into federated learning. However, previous works do not give a practical solution due to three issues. First, the noisy data is close to its original value with high probability, increasing the risk of information exposure. Second, a large variance is introduced to the estimated average, causing poor accuracy. Last, the privacy budget explodes due to the high dimensionality of weights in deep learning models. In this paper, we proposed a novel design of local differential privacy mechanism for federated learning to address the abovementioned issues. It is capable of making the data more distinct from its original value and introducing lower variance. Moreover, the proposed mechanism bypasses the curse of dimensionality by splitting and shuffling model updates. A series of empirical evaluations on three commonly used datasets, MNIST, Fashion-MNIST and CIFAR-10, demonstrate that our solution can not only achieve superior deep learning performance but also provide a strong privacy guarantee at the same time.
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an interactive process of local computation and message passing. Such an iterative process could cause privacy concerns of data owners. The goal of this paper is to provide differential privacy for ADMM-based distributed machine learning. Prior approaches on differentially private ADMM exhibit low utility under high privacy guarantee and often assume the objective functions of the learning problems to be smooth and strongly convex. To address these concerns, we propose a novel differentially private ADMM-based distributed learning algorithm called DP-ADMM, which combines an approximate augmented Lagrangian function with time-varying Gaussian noise addition in the iterative process to achieve higher utility for general objective functions under the same differential privacy guarantee. We also apply the moments accountant method to bound the end-to-end privacy loss. The theoretical analysis shows that DP-ADMM can be applied to a wider class of distributed learning problems, is provably convergent, and offers an explicit utility-privacy tradeoff. To our knowledge, this is the first paper to provide explicit convergence and utility properties for differentially private ADMM-based distributed learning algorithms. The evaluation results demonstrate that our approach can achieve good convergence and model accuracy under high end-to-end differential privacy guarantee.
Machine Learning is a widely-used method for prediction generation. These predictions are more accurate when the model is trained on a larger dataset. On the other hand, the data is usually divided amongst different entities. For privacy reasons, the training can be done locally and then the model can be safely aggregated amongst the participants. However, if there are only two participants in \textit{Collaborative Learning}, the safe aggregation loses its power since the output of the training already contains much information about the participants. To resolve this issue, they must employ privacy-preserving mechanisms, which inevitably affect the accuracy of the model. In this paper, we model the training process as a two-player game where each player aims to achieve a higher accuracy while preserving its privacy. We introduce the notion of \textit{Price of Privacy}, a novel approach to measure the effect of privacy protection on the accuracy of the model. We develop a theoretical model for different player types, and we either find or prove the existence of a Nash Equilibrium with some assumptions. Moreover, we confirm these assumptions via a Recommendation Systems use case: for a specific learning algorithm, we apply three privacy-preserving mechanisms on two real-world datasets. Finally, as a complementary work for the designed game, we interpolate the relationship between privacy and accuracy for this use case and present three other methods to approximate it in a real-world scenario.