In the era of big data and the Internet of Things (IoT), data owners need to share a large amount of data with the intended receivers in an insecure environment, posing a trade-off issue between user privacy and data utility. The privacy utility trade-off was facilitated through a privacy funnel based on mutual information. Nevertheless, it is challenging to characterize the mutual information accurately with small sample size or unknown distribution functions. In this article, we propose a privacy funnel based on mutual information neural estimator (MINE) to optimize the privacy utility trade-off by estimating mutual information. Instead of computing mutual information in traditional way, we estimate it using an MINE, which obtains the estimated mutual information in a trained way, ensuring that the estimation results are as precise as possible. We employ estimated mutual information as a measure of privacy and utility, and then form a problem to optimize data utility by training a neural network while the estimator's privacy discourse is less than a threshold. The simulation results also demonstrated that the estimated mutual information from MINE works very well to approximate the mutual information even with a limited number of samples to quantify privacy leakage and data utility retention, as well as optimize the privacy utility trade-off.
Recent research in differential privacy demonstrated that (sub)sampling can amplify the level of protection. For example, for $\epsilon$-differential privacy and simple random sampling with sampling rate $r$, the actual privacy guarantee is approximately $r\epsilon$, if a value of $\epsilon$ is used to protect the output from the sample. In this paper, we study whether this amplification effect can be exploited systematically to improve the accuracy of the privatized estimate. Specifically, assuming the agency has information for the full population, we ask under which circumstances accuracy gains could be expected, if the privatized estimate would be computed on a random sample instead of the full population. We find that accuracy gains can be achieved for certain regimes. However, gains can typically only be expected, if the sensitivity of the output with respect to small changes in the database does not depend too strongly on the size of the database. We only focus on algorithms that achieve differential privacy by adding noise to the final output and illustrate the accuracy implications for two commonly used statistics: the mean and the median. We see our research as a first step towards understanding the conditions required for accuracy gains in practice and we hope that these findings will stimulate further research broadening the scope of differential privacy algorithms and outputs considered.
Deep Neural Networks (DNNs) have achieved remarkable progress in various real-world applications, especially when abundant training data are provided. However, data isolation has become a serious problem currently. Existing works build privacy preserving DNN models from either algorithmic perspective or cryptographic perspective. The former mainly splits the DNN computation graph between data holders or between data holders and server, which demonstrates good scalability but suffers from accuracy loss and potential privacy risks. In contrast, the latter leverages time-consuming cryptographic techniques, which has strong privacy guarantee but poor scalability. In this paper, we propose SPNN - a Scalable and Privacy-preserving deep Neural Network learning framework, from algorithmic-cryptographic co-perspective. From algorithmic perspective, we split the computation graph of DNN models into two parts, i.e., the private data related computations that are performed by data holders and the rest heavy computations that are delegated to a server with high computation ability. From cryptographic perspective, we propose using two types of cryptographic techniques, i.e., secret sharing and homomorphic encryption, for the isolated data holders to conduct private data related computations privately and cooperatively. Furthermore, we implement SPNN in a decentralized setting and introduce user-friendly APIs. Experimental results conducted on real-world datasets demonstrate the superiority of SPNN.
Pufferfish privacy achieves $\epsilon$-indistinguishability over a set of secret pairs in the disclosed data. This paper studies how to attain $\epsilon$-pufferfish privacy by exponential mechanism, an additive noise scheme that generalizes the Laplace noise. It is shown that the disclosed data is $\epsilon$-pufferfish private if the noise is calibrated to the sensitivity of the Kantorovich optimal transport plan. Such a plan can be obtained directly from the data statistics conditioned on the secret, the prior knowledge of the system. The sufficient condition is further relaxed to reduce the noise power. It is also proved that the Gaussian mechanism based on the Kantorovich approach attains the $\delta$-approximation of $\epsilon$-pufferfish privacy.
In this work, we focus on solving a decentralized consensus problem in a private manner. Specifically, we consider a setting in which a group of nodes, connected through a network, aim at computing the mean of their local values without revealing those values to each other. The distributed consensus problem is a classic problem that has been extensively studied and its convergence characteristics are well-known. Alas, state-of-the-art consensus methods build on the idea of exchanging local information with neighboring nodes which leaks information about the users' local values. We propose an algorithmic framework that is capable of achieving the convergence limit and rate of classic consensus algorithms while keeping the users' local values private. The key idea of our proposed method is to carefully design noisy messages that are passed from each node to its neighbors such that the consensus algorithm still converges precisely to the average of local values, while a minimum amount of information about local values is leaked. We formalize this by precisely characterizing the mutual information between the private message of a node and all the messages that another adversary collects over time. We prove that our method is capable of preserving users' privacy for any network without a so-called "generalized leaf", and formalize the trade-off between privacy and convergence time. Unlike many private algorithms, any desired accuracy is achievable by our method, and the required level of privacy only affects the convergence time.
This paper presents a crowd monitoring system based on the passive detection of probe requests. The system meets strict privacy requirements and is suited to monitoring events or buildings with a least a few hundreds of attendees. We present our counting process and an associated mathematical model. From this model, we derive a concentration inequality that highlights the accuracy of our crowd count estimator. Then, we describe our system. We present and discuss our sensor hardware, our computing system architecture, and an efficient implementation of our counting algorithm -- as well as its space and time complexity. We also show how our system ensures the privacy of people in the monitored area. Finally, we validate our system using nine weeks of data from a public library endowed with a camera-based counting system, which generates counts against which we compare those of our counting system. This comparison empirically quantifies the accuracy of our counting system, thereby showing it to be suitable for monitoring public areas. Similarly, the concentration inequality provides a theoretical validation of the system.
We study the problem of selling information to a data-buyer who faces a decision problem under uncertainty. We consider the classic Bayesian decision-theoretic model pioneered by [Blackwell, 1951, 1953]. Initially, the data buyer has only partial information about the payoff-relevant state of the world. A data seller offers additional information about the state of the world. The information is revealed through signaling schemes, also referred to as experiments. In the single-agent setting, any mechanism can be represented as a menu of experiments. [Bergemann et al., 2018] present a complete characterization of the revenue-optimal mechanism in a binary state and binary action environment. By contrast, no characterization is known for the case with more actions. In this paper, we consider more general environments and study arguably the simplest mechanism, which only sells the fully informative experiment. In the environment with binary state and $m\geq 3$ actions, we provide an $O(m)$-approximation to the optimal revenue by selling only the fully informative experiment and show that the approximation ratio is tight up to an absolute constant factor. An important corollary of our lower bound is that the size of the optimal menu must grow at least linearly in the number of available actions, so no universal upper bound exists for the size of the optimal menu in the general single-dimensional setting. For multi-dimensional environments, we prove that even in arguably the simplest matching utility environment with 3 states and 3 actions, the ratio between the optimal revenue and the revenue by selling only the fully informative experiment can grow immediately to a polynomial of the number of agent types. Nonetheless, if the distribution is uniform, we show that selling only the fully informative experiment is indeed the optimal mechanism.
In this paper, we propose two new definitions of local differential privacy for belief functions. One is based on Shafer's semantics of randomly coded messages and the other from the perspective of imprecise probabilities. We show that such basic properties as composition and post-processing also hold for our new definitions. Moreover, we provide a hypothesis testing framework for these definitions and study the effect of "don't know" in the trade-off between privacy and utility in discrete distribution estimation.
Train machine learning models on sensitive user data has raised increasing privacy concerns in many areas. Federated learning is a popular approach for privacy protection that collects the local gradient information instead of real data. One way to achieve a strict privacy guarantee is to apply local differential privacy into federated learning. However, previous works do not give a practical solution due to three issues. First, the noisy data is close to its original value with high probability, increasing the risk of information exposure. Second, a large variance is introduced to the estimated average, causing poor accuracy. Last, the privacy budget explodes due to the high dimensionality of weights in deep learning models. In this paper, we proposed a novel design of local differential privacy mechanism for federated learning to address the abovementioned issues. It is capable of making the data more distinct from its original value and introducing lower variance. Moreover, the proposed mechanism bypasses the curse of dimensionality by splitting and shuffling model updates. A series of empirical evaluations on three commonly used datasets, MNIST, Fashion-MNIST and CIFAR-10, demonstrate that our solution can not only achieve superior deep learning performance but also provide a strong privacy guarantee at the same time.
Federated learning has been showing as a promising approach in paving the last mile of artificial intelligence, due to its great potential of solving the data isolation problem in large scale machine learning. Particularly, with consideration of the heterogeneity in practical edge computing systems, asynchronous edge-cloud collaboration based federated learning can further improve the learning efficiency by significantly reducing the straggler effect. Despite no raw data sharing, the open architecture and extensive collaborations of asynchronous federated learning (AFL) still give some malicious participants great opportunities to infer other parties' training data, thus leading to serious concerns of privacy. To achieve a rigorous privacy guarantee with high utility, we investigate to secure asynchronous edge-cloud collaborative federated learning with differential privacy, focusing on the impacts of differential privacy on model convergence of AFL. Formally, we give the first analysis on the model convergence of AFL under DP and propose a multi-stage adjustable private algorithm (MAPA) to improve the trade-off between model utility and privacy by dynamically adjusting both the noise scale and the learning rate. Through extensive simulations and real-world experiments with an edge-could testbed, we demonstrate that MAPA significantly improves both the model accuracy and convergence speed with sufficient privacy guarantee.
Alternating Direction Method of Multipliers (ADMM) is a widely used tool for machine learning in distributed settings, where a machine learning model is trained over distributed data sources through an interactive process of local computation and message passing. Such an iterative process could cause privacy concerns of data owners. The goal of this paper is to provide differential privacy for ADMM-based distributed machine learning. Prior approaches on differentially private ADMM exhibit low utility under high privacy guarantee and often assume the objective functions of the learning problems to be smooth and strongly convex. To address these concerns, we propose a novel differentially private ADMM-based distributed learning algorithm called DP-ADMM, which combines an approximate augmented Lagrangian function with time-varying Gaussian noise addition in the iterative process to achieve higher utility for general objective functions under the same differential privacy guarantee. We also apply the moments accountant method to bound the end-to-end privacy loss. The theoretical analysis shows that DP-ADMM can be applied to a wider class of distributed learning problems, is provably convergent, and offers an explicit utility-privacy tradeoff. To our knowledge, this is the first paper to provide explicit convergence and utility properties for differentially private ADMM-based distributed learning algorithms. The evaluation results demonstrate that our approach can achieve good convergence and model accuracy under high end-to-end differential privacy guarantee.