The emerging public awareness and government regulations of data privacy motivate new paradigms of collecting and analyzing data transparent and acceptable to data owners. We present a new concept of privacy and corresponding data formats, mechanisms, and theories for privatizing data during data collection. The privacy, named Interval Privacy, enforces the raw data conditional distribution on the privatized data to be the same as its unconditional distribution over a nontrivial support set. Correspondingly, the proposed privacy mechanism will record each data value as a random interval (or, more generally, a range) containing it. The proposed interval privacy mechanisms can be easily deployed through survey-based data collection interfaces, e.g., by asking a respondent whether its data value is within a randomly generated range. Another unique feature of interval mechanisms is that they obfuscate the truth but not perturb it. Using narrowed range to convey information is complementary to the popular paradigm of perturbing data. Also, the interval mechanisms can generate progressively refined information at the discretion of individuals, naturally leading to privacy-adaptive data collection. We develop different aspects of theory such as composition, robustness, distribution estimation, and regression learning from interval-valued data. Interval privacy provides a new perspective of human-centric data privacy where individuals have a perceptible, transparent, and simple way of sharing sensitive data.
Although robust learning and local differential privacy are both widely studied fields of research, combining the two settings is just starting to be explored. We consider the problem of estimating a discrete distribution in total variation from $n$ contaminated data batches under a local differential privacy constraint. A fraction $1-\epsilon$ of the batches contain $k$ i.i.d. samples drawn from a discrete distribution $p$ over $d$ elements. To protect the users' privacy, each of the samples is privatized using an $\alpha$-locally differentially private mechanism. The remaining $\epsilon n $ batches are an adversarial contamination. The minimax rate of estimation under contamination alone, with no privacy, is known to be $\epsilon/\sqrt{k}+\sqrt{d/kn}$, up to a $\sqrt{\log(1/\epsilon)}$ factor. Under the privacy constraint alone, the minimax rate of estimation is $\sqrt{d^2/\alpha^2 kn}$. We show that combining the two constraints leads to a minimax estimation rate of $\epsilon\sqrt{d/\alpha^2 k}+\sqrt{d^2/\alpha^2 kn}$ up to a $\sqrt{\log(1/\epsilon)}$ factor, larger than the sum of the two separate rates. We provide a polynomial-time algorithm achieving this bound, as well as a matching information theoretic lower bound.
Large scale adoption of large language models has introduced a new era of convenient knowledge transfer for a slew of natural language processing tasks. However, these models also run the risk of undermining user trust by exposing unwanted information about the data subjects, which may be extracted by a malicious party, e.g. through adversarial attacks. We present an empirical investigation into the extent of the personal information encoded into pre-trained representations by a range of popular models, and we show a positive correlation between the complexity of a model, the amount of data used in pre-training, and data leakage. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular privacy-preserving algorithms, on a large, multi-lingual dataset on sentiment analysis annotated with demographic information (location, age and gender). The results show since larger and more complex models are more prone to leaking private information, use of privacy-preserving methods is highly desirable. We also find that highly privacy-preserving technologies like differential privacy (DP) can have serious model utility effects, which can be ameliorated using hybrid or metric-DP techniques.
Face privacy-preserving is one of the hotspots that arises dramatic interests of research. However, the existing face privacy-preserving methods aim at causing the missing of semantic information of face and cannot preserve the reusability of original facial information. To achieve the naturalness of the processed face and the recoverability of the original protected face, this paper proposes face privacy-preserving method based on Invertible "Mask" Network (IMN). In IMN, we introduce a Mask-net to generate "Mask" face firstly. Then, put the "Mask" face onto the protected face and generate the masked face, in which the masked face is indistinguishable from "Mask" face. Finally, "Mask" face can be put off from the masked face and obtain the recovered face to the authorized users, in which the recovered face is visually indistinguishable from the protected face. The experimental results show that the proposed method can not only effectively protect the privacy of the protected face, but also almost perfectly recover the protected face from the masked face.
Privacy protection is an essential issue in personalized news recommendation, and federated learning can potentially mitigate the privacy concern by training personalized news recommendation models over decentralized user data.For a theoretical privacy guarantee, differential privacy is necessary. However, applying differential privacy to federated recommendation training and serving conventionally suffers from the unsatisfactory trade-off between privacy and utility due to the high-dimensional characteristics of model gradients and hidden representations. In addition, there is no formal privacy guarantee for both training and serving in federated recommendation. In this paper, we propose a unified federated news recommendation method for effective and privacy-preserving model training and online serving with differential privacy guarantees. We first clarify the notion of differential privacy over users' behavior data for both model training and online serving in the federated recommendation scenario. Next, we propose a privacy-preserving online serving mechanism under this definition with differentially private user interest decomposition. More specifically, it decomposes the high-dimensional and privacy-sensitive user embedding into a combination of public basic vectors and adds noise to the combination coefficients. In this way, it can avoid the dimension curse and improve the utility by reducing the required noise intensity for differential privacy. Besides, we design a federated recommendation model training method with differential privacy, which can avoid the dimension-dependent noise for large models via label permutation and differentially private attention modules. Experiments on real-world news recommendation datasets validate the effectiveness of our method in achieving a good trade-off between privacy protection and utility for federated news recommendations.
The concept of federated learning (FL) was first proposed by Google in 2016. Thereafter, FL has been widely studied for the feasibility of application in various fields due to its potential to make full use of data without compromising the privacy. However, limited by the capacity of wireless data transmission, the employment of federated learning on mobile devices has been making slow progress in practical. The development and commercialization of the 5th generation (5G) mobile networks has shed some light on this. In this paper, we analyze the challenges of existing federated learning schemes for mobile devices and propose a novel cross-device federated learning framework, which utilizes the anonymous communication technology and ring signature to protect the privacy of participants while reducing the computation overhead of mobile devices participating in FL. In addition, our scheme implements a contribution-based incentive mechanism to encourage mobile users to participate in FL. We also give a case study of autonomous driving. Finally, we present the performance evaluation of the proposed scheme and discuss some open issues in federated learning.
With the increasing adoption of NLP models in real-world products, it becomes more and more important to protect these models from privacy leakage. Because private information in language data is sparse, previous research formalized a Selective-Differential-Privacy (SDP) notion to provide protection for sensitive tokens detected by policy functions, and prove its effectiveness on RNN-based models. But the previous mechanism requires separating the private and public model parameters and thus cannot be applied on large attention-based models. In this paper, we propose a simple yet effective just-fine-tune-twice privacy mechanism to first fine-tune on in-domain redacted data and then on in-domain private data, to achieve SDP for large Transformer-based language models. We also design explicit and contextual policy functions to provide protections at different levels. Experiments show that our models achieve strong performance while staying robust to the canary insertion attack. We further show that even under low-resource settings with a small amount of in-domain data, SDP can still improve the model utility. We will release the code, data and models to facilitate future research.
Bayesian model selection provides a powerful framework for objectively comparing models directly from observed data, without reference to ground truth data. However, Bayesian model selection requires the computation of the marginal likelihood (model evidence), which is computationally challenging, prohibiting its use in many high-dimensional Bayesian inverse problems. With Bayesian imaging applications in mind, in this work we present the proximal nested sampling methodology to objectively compare alternative Bayesian imaging models for applications that use images to inform decisions under uncertainty. The methodology is based on nested sampling, a Monte Carlo approach specialised for model comparison, and exploits proximal Markov chain Monte Carlo techniques to scale efficiently to large problems and to tackle models that are log-concave and not necessarily smooth (e.g., involving l_1 or total-variation priors). The proposed approach can be applied computationally to problems of dimension O(10^6) and beyond, making it suitable for high-dimensional inverse imaging problems. It is validated on large Gaussian models, for which the likelihood is available analytically, and subsequently illustrated on a range of imaging problems where it is used to analyse different choices of dictionary and measurement model.
Background. From information theory, surprisal is a measurement of how unexpected an event is. Statistical language models provide a probabilistic approximation of natural languages, and because surprisal is constructed with the probability of an event occuring, it is therefore possible to determine the surprisal associated with English sentences. The issues and pull requests of software repository issue trackers give insight into the development process and likely contain the surprising events of this process. Objective. Prior works have identified that unusual events in software repositories are of interest to developers, and use simple code metrics-based methods for detecting them. In this study we will propose a new method for unusual event detection in software repositories using surprisal. With the ability to find surprising issues and pull requests, we intend to further analyse them to determine if they actually hold importance in a repository, or if they pose a significant challenge to address. If it is possible to find bad surprises early, or before they cause additional troubles, it is plausible that effort, cost and time will be saved as a result. Method. After extracting the issues and pull requests from 5000 of the most popular software repositories on GitHub, we will train a language model to represent these issues. We will measure their perceived importance in the repository, measure their resolution difficulty using several analogues, measure the surprisal of each, and finally generate inferential statistics to describe any correlations.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.
State-of-the-art recommendation algorithms -- especially the collaborative filtering (CF) based approaches with shallow or deep models -- usually work with various unstructured information sources for recommendation, such as textual reviews, visual images, and various implicit or explicit feedbacks. Though structured knowledge bases were considered in content-based approaches, they have been largely neglected recently due to the availability of vast amount of data, and the learning power of many complex models. However, structured knowledge bases exhibit unique advantages in personalized recommendation systems. When the explicit knowledge about users and items is considered for recommendation, the system could provide highly customized recommendations based on users' historical behaviors. A great challenge for using knowledge bases for recommendation is how to integrated large-scale structured and unstructured data, while taking advantage of collaborative filtering for highly accurate performance. Recent achievements on knowledge base embedding sheds light on this problem, which makes it possible to learn user and item representations while preserving the structure of their relationship with external knowledge. In this work, we propose to reason over knowledge base embeddings for personalized recommendation. Specifically, we propose a knowledge base representation learning approach to embed heterogeneous entities for recommendation. Experimental results on real-world dataset verified the superior performance of our approach compared with state-of-the-art baselines.