We present PORT, a software platform for local data extraction and analysis of digital trace data. While digital trace data collected by private and public parties hold a huge potential for social-scientific discovery, their most useful parts have been unattainable for academic researchers due to privacy concerns and prohibitive API access. However, the EU General Data Protection Regulation (GDPR) grants all citizens the right to an electronic copy of their personal data. All major data controllers, such as social media platforms, banks, online shops, loyalty card systems and public transportation cards comply with this right by providing their clients with a `Data Download Package' (DDP). Previously, a conceptual workflow was introduced allowing citizens to donate their data to scientific- researchers. In this workflow, citizens' DDPs are processed locally on their machines before they are asked to provide informed consent to share a subset of the processed data with the researchers. In this paper, we present the newly developed software PORT that implements the local processing part of this workflow, protecting privacy by shielding sensitive data from any contact with outside observers -- including the researchers themselves. Thus, PORT enables a host of potential applications of social data science to hitherto unobtainable data.
Broadcast/multicast communication systems are typically designed to optimize the outage rate criterion, which neglects the performance of the fraction of clients with the worst channel conditions. Targeting ultra-reliable communication scenarios, this paper takes a complementary approach by introducing the conditional value-at-risk (CVaR) rate as the expected rate of a worst-case fraction of clients. To support differential quality-of-service (QoS) levels in this class of clients, layered division multiplexing (LDM) is applied, which enables decoding at different rates. Focusing on a practical scenario in which the transmitter does not know the fading distribution, layer allocation is optimized based on a dataset sampled during deployment. The optimality gap caused by the availability of limited data is bounded via a generalization analysis, and the sample complexity is shown to increase as the designated fraction of worst-case clients decreases. Considering this theoretical result, meta-learning is introduced as a means to reduce sample complexity by leveraging data from previous deployments. Numerical experiments demonstrate that LDM improves spectral efficiency even for small datasets; that, for sufficiently large datasets, the proposed mirror-descent-based layer optimization scheme achieves a CVaR rate close to that achieved when the transmitter knows the fading distribution; and that meta-learning can significantly reduce data requirements.
Making an updated and as-built model plays an important role in the life-cycle of a process plant. In particular, Digital Twin models must be precise to guarantee the efficiency and reliability of the systems. Data-driven models can simulate the latest behavior of the sub-systems by considering uncertainties and life-cycle related changes. This paper presents a step-by-step concept for hybrid Digital Twin models of process plants using an early implemented prototype as an example. It will detail the steps for updating the first-principles model and Digital Twin of a brownfield process system using data-driven models of the process equipment. The challenges for generation of an as-built hybrid Digital Twin will also be discussed. With the help of process history data to teach Machine Learning models, the implemented Digital Twin can be continually improved over time and this work in progress can be further optimized.
Digital vaccine passports are one of the main solutions which would allow the restart of travel in a post COVID-19 world. Trust, scalability and security are all key challenges one must overcome in implementing a vaccine passport. Initial approaches attempt to solve this problem by using centralised systems with trusted authorities. However, sharing vaccine passport data between different organisations, regions and countries has become a major challenge. This paper designs a new platform architecture for creating, storing and verifying digital COVID-19 vaccine certifications. The platform makes use of the InterPlanetary File System (IPFS) to guarantee there is no single point of failure and allow data to be securely distributed globally. Blockchain and smart contracts are also integrated into the platform to define policies and log access rights to vaccine passport data while ensuring all actions are audited and verifiably immutable. Our proposed platform realises General Data Protection Regulation (GDPR) requirements in terms of user consent, data encryption, data erasure and accountability obligations. We assess the scalability and performance of the platform using IPFS and Blockchain test networks.
We propose and analyze algorithms to solve a range of learning tasks under user-level differential privacy constraints. Rather than guaranteeing only the privacy of individual samples, user-level DP protects a user's entire contribution ($m \ge 1$ samples), providing more stringent but more realistic protection against information leaks. We show that for high-dimensional mean estimation, empirical risk minimization with smooth losses, stochastic convex optimization, and learning hypothesis classes with finite metric entropy, the privacy cost decreases as $O(1/\sqrt{m})$ as users provide more samples. In contrast, when increasing the number of users $n$, the privacy cost decreases at a faster $O(1/n)$ rate. We complement these results with lower bounds showing the minimax optimality of our algorithms for mean estimation and stochastic convex optimization. Our algorithms rely on novel techniques for private mean estimation in arbitrary dimension with error scaling as the concentration radius $\tau$ of the distribution rather than the entire range.
Many video classification applications require access to personal data, thereby posing an invasive security risk to the users' privacy. We propose a privacy-preserving implementation of single-frame method based video classification with convolutional neural networks that allows a party to infer a label from a video without necessitating the video owner to disclose their video to other entities in an unencrypted manner. Similarly, our approach removes the requirement of the classifier owner from revealing their model parameters to outside entities in plaintext. To this end, we combine existing Secure Multi-Party Computation (MPC) protocols for private image classification with our novel MPC protocols for oblivious single-frame selection and secure label aggregation across frames. The result is an end-to-end privacy-preserving video classification pipeline. We evaluate our proposed solution in an application for private human emotion recognition. Our results across a variety of security settings, spanning honest and dishonest majority configurations of the computing parties, and for both passive and active adversaries, demonstrate that videos can be classified with state-of-the-art accuracy, and without leaking sensitive user information.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.
News recommendation aims to display news articles to users based on their personal interest. Existing news recommendation methods rely on centralized storage of user behavior data for model training, which may lead to privacy concerns and risks due to the privacy-sensitive nature of user behaviors. In this paper, we propose a privacy-preserving method for news recommendation model training based on federated learning, where the user behavior data is locally stored on user devices. Our method can leverage the useful information in the behaviors of massive number users to train accurate news recommendation models and meanwhile remove the need of centralized storage of them. More specifically, on each user device we keep a local copy of the news recommendation model, and compute gradients of the local model based on the user behaviors in this device. The local gradients from a group of randomly selected users are uploaded to server, which are further aggregated to update the global model in the server. Since the model gradients may contain some implicit private information, we apply local differential privacy (LDP) to them before uploading for better privacy protection. The updated global model is then distributed to each user device for local model update. We repeat this process for multiple rounds. Extensive experiments on a real-world dataset show the effectiveness of our method in news recommendation model training with privacy protection.
Train machine learning models on sensitive user data has raised increasing privacy concerns in many areas. Federated learning is a popular approach for privacy protection that collects the local gradient information instead of real data. One way to achieve a strict privacy guarantee is to apply local differential privacy into federated learning. However, previous works do not give a practical solution due to three issues. First, the noisy data is close to its original value with high probability, increasing the risk of information exposure. Second, a large variance is introduced to the estimated average, causing poor accuracy. Last, the privacy budget explodes due to the high dimensionality of weights in deep learning models. In this paper, we proposed a novel design of local differential privacy mechanism for federated learning to address the abovementioned issues. It is capable of making the data more distinct from its original value and introducing lower variance. Moreover, the proposed mechanism bypasses the curse of dimensionality by splitting and shuffling model updates. A series of empirical evaluations on three commonly used datasets, MNIST, Fashion-MNIST and CIFAR-10, demonstrate that our solution can not only achieve superior deep learning performance but also provide a strong privacy guarantee at the same time.
In recent years, mobile devices have gained increasingly development with stronger computation capability and larger storage. Some of the computation-intensive machine learning and deep learning tasks can now be run on mobile devices. To take advantage of the resources available on mobile devices and preserve users' privacy, the idea of mobile distributed machine learning is proposed. It uses local hardware resources and local data to solve machine learning sub-problems on mobile devices, and only uploads computation results instead of original data to contribute to the optimization of the global model. This architecture can not only relieve computation and storage burden on servers, but also protect the users' sensitive information. Another benefit is the bandwidth reduction, as various kinds of local data can now participate in the training process without being uploaded to the server. In this paper, we provide a comprehensive survey on recent studies of mobile distributed machine learning. We survey a number of widely-used mobile distributed machine learning methods. We also present an in-depth discussion on the challenges and future directions in this area. We believe that this survey can demonstrate a clear overview of mobile distributed machine learning and provide guidelines on applying mobile distributed machine learning to real applications.
Machine Learning is a widely-used method for prediction generation. These predictions are more accurate when the model is trained on a larger dataset. On the other hand, the data is usually divided amongst different entities. For privacy reasons, the training can be done locally and then the model can be safely aggregated amongst the participants. However, if there are only two participants in \textit{Collaborative Learning}, the safe aggregation loses its power since the output of the training already contains much information about the participants. To resolve this issue, they must employ privacy-preserving mechanisms, which inevitably affect the accuracy of the model. In this paper, we model the training process as a two-player game where each player aims to achieve a higher accuracy while preserving its privacy. We introduce the notion of \textit{Price of Privacy}, a novel approach to measure the effect of privacy protection on the accuracy of the model. We develop a theoretical model for different player types, and we either find or prove the existence of a Nash Equilibrium with some assumptions. Moreover, we confirm these assumptions via a Recommendation Systems use case: for a specific learning algorithm, we apply three privacy-preserving mechanisms on two real-world datasets. Finally, as a complementary work for the designed game, we interpolate the relationship between privacy and accuracy for this use case and present three other methods to approximate it in a real-world scenario.