免费在线黄色电影-人人操人人莫人人草

The synthpop package for R //www.synthpop.org.uk provides tools to allow data custodians to create synthetic versions of confidential microdata that can be distributed with fewer restrictions than the original. The synthesis can be customized to ensure that relationships evident in the real data are reproduced in the synthetic data. A number of measures have been proposed to assess this aspect, commonly known as the utility of the synthetic data. We show that all these measures, including those calculated from tabulations, can be derived from a propensity score model. The measures will be reviewed and compared, and relations between them illustrated. All the measures compared are highly correlated and some are shown to be identical. The method used to define the propensity score model is more important than the choice of measure. These measures and methods are incorporated into utility modules in the synthpop package that include methods to visualize the results and thus provide immediate feedback to allow the person creating the synthetic data to improve its quality. The utility functions were originally designed to be used for synthetic data objects of class \code{synds}, created by the \pkg{synthpop} function syn() or syn.strata(), but they can now be used to compare one or more synthesised data sets with the original records, where the records are R data frames or lists of data frames.

相關內容

原點

關注 1

MICRO · 數據集 · Continuity · binary · 統計量 ·

2022 年 1 月 19 日

Bayesian Data Synthesis and the Utility-Risk Trade-Off for Mixed Epidemiological Data

Joseph Feldman,Daniel Kowal

from arxiv, 24 pages, 4 figures, 3 tables, accepted The Annals of Applied Statistics

Much of the micro data used for epidemiological studies contain sensitive measurements on real individuals. As a result, such micro data cannot be published out of privacy concerns, rendering any published statistical analyses on them nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic, high dimensional micro datasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and preserve these conditional relationships, including both nonlinearities and interactions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.

COVID-19 · 可穿戴設備 · Use Case · Continuity · Seven ·

2022 年 1 月 19 日

Visualization and Analysis of Wearable Health Data From COVID-19 Patients

Susanne K. Suter,Georg R. Spinner,Bianca Hoelz,Sofia Rey,Sujeanthraa Thanabalasingam,Jens Eckstein,Sven Hirsch

from arxiv, 17 pages, 9 figures, conference

Effective visualizations were evaluated to reveal relevant health patterns from multi-sensor real-time wearable devices that recorded vital signs from patients admitted to hospital with COVID-19. Furthermore, specific challenges associated with wearable health data visualizations, such as fluctuating data quality resulting from compliance problems, time needed to charge the device and technical problems are described. As a primary use case, we examined the detection and communication of relevant health patterns visible in the vital signs acquired by the technology. Customized heat maps and bar charts were used to specifically highlight medically relevant patterns in vital signs. A survey of two medical doctors, one clinical project manager and seven health data science researchers was conducted to evaluate the visualization methods. From a dataset of 84 hospitalized COVID-19 patients, we extracted one typical COVID-19 patient history and based on the visualizations showcased the health history of two noteworthy patients. The visualizations were shown to be effective, simple and intuitive in deducing the health status of patients. For clinical staff who are time-constrained and responsible for numerous patients, such visualization methods can be an effective tool to enable continuous acquisition and monitoring of patients' health statuses even remotely.

推斷 · MoDELS · 數據集 · Performer · state-of-the-art ·

2022 年 1 月 18 日

On Utility and Privacy in Synthetic Genomic Data

Bristena Oprisanu,Georgi Ganev,Emiliano De Cristofaro

from arxiv, Published in the Proceedings of the 29th Network and Distributed System Security Symposium (NDSS 2022)

The availability of genomic data is essential to progress in biomedical research, personalized medicine, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. As a result, several initiatives have been launched to experiment with synthetic genomic data, e.g., using generative models to learn the underlying distribution of the real data and generate artificial datasets that preserve its salient characteristics without exposing it. This paper provides the first evaluation of both utility and privacy protection of six state-of-the-art models for generating synthetic genomic data. We assess the performance of the synthetic data on several common tasks, such as allele population statistics and linkage disequilibrium. We then measure privacy through the lens of membership inference attacks, i.e., inferring whether a record was part of the training data. Our experiments show that no single approach to generate synthetic genomic data yields both high utility and strong privacy across the board. Also, the size and nature of the training dataset matter. Moreover, while some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Looking forward, our techniques can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild and serve as a benchmark for future work.

統計量 · MoDELS · Performer · 有偏 · 異方差 ·

2022 年 1 月 17 日

Simulation Models for Aggregated Data Meta-Analysis: Evaluation of Pooling Effect Sizes and Publication Biases

Edwin R. van den Heuvel,Osama Almalik,Zhuozhao Zhan

from arxiv, 26 pages, 6 tables

Simulation studies are commonly used to evaluate the performance of newly developed meta-analysis methods. For methodology that is developed for an aggregated data meta-analysis, researchers often resort to simulation of the aggregated data directly, instead of simulating individual participant data from which the aggregated data would be calculated in reality. Clearly, distributional characteristics of the aggregated data statistics may be derived from distributional assumptions of the underlying individual data, but they are often not made explicit in publications. This paper provides the distribution of the aggregated data statistics that were derived from a heteroscedastic mixed effects model for continuous individual data. As a result, we provide a procedure for directly simulating the aggregated data statistics. We also compare our distributional findings with other simulation approaches of aggregated data used in literature by describing their theoretical differences and by conducting a simulation study for three meta-analysis methods: DerSimonian and Laird's pooled estimate and the Trim & Fill and PET-PEESE method for adjustment of publication bias. We demonstrate that the choices of simulation model for aggregated data may have a relevant impact on (the conclusions of) the performance of the meta-analysis method. We recommend the use of multiple aggregated data simulation models for investigation of new methodology to determine sensitivity or otherwise make the individual participant data model explicit that would lead to the distributional choices of the aggregated data statistics used in the simulation.

大數據 · Storage · Processing（編程語言） · INFORMS · 情景 ·

2022 年 1 月 15 日

Characterizing Big Data Management

Rogerio Rossi,Kechi Hirama

from arxiv, volume 12, 2015

Big data management is a reality for an increasing number of organizations in many areas and represents a set of challenges involving big data modeling, storage and retrieval, analysis and visualization. However, technological resources, people and processes are crucial to facilitate the management of big data in any kind of organization, allowing information and knowledge from a large volume of data to support decision-making. Big data management can be supported by these three dimensions: technology, people and processes. Hence, this article discusses these dimensions: the technological dimension that is related to storage, analytics and visualization of big data; the human aspects of big data; and, in addition, the process management dimension that involves in a technological and business approach the aspects of big data management.

INFORMS · Performer · TOOLS · 樣例 · 情景 ·

2022 年 1 月 14 日

Probabilistic Counters for Privacy Preserving Data Aggregation

Dominik Bojko,Krzysztof Grining,Marek Klonowski

Probabilistic counters are well known tools often used for space-efficient set cardinality estimation. In this paper we investigate probabilistic counters from the perspective of preserving privacy. We use standard, rigid differential privacy notion. The intuition is that the probabilistic counters do not reveal too much information about individuals, but provide only general information about the population. Thus they can be used safely without violating privacy of individuals. It turned out however that providing a precise, formal analysis of privacy parameters of probabilistic counters is surprisingly difficult and needs advanced techniques and a very careful approach. We demonstrate also that probabilistic counters can be used as a privacy protecion mechanism without any extra randomization. That is, the inherit randomization from the protocol is sufficient for protecting privacy, even if the probabilistic counter is used many times. In particular we present a specific privacy-preserving data aggregation protocol based on a probabilistic counter. Our results can be used for example in performing distributed surveys.

頻率主義學派 · 推斷 · 幾乎必然收斂 · 樣本 · 幾乎必然 ·

2022 年 1 月 13 日

The Effect of Sample Size and Missingness on Inference with Missing Data

Julian Morimoto

from arxiv, Submitted as of January 12, 2022

When are inferences (whether Direct-Likelihood, Bayesian, or Frequentist) obtained from partial data valid? This paper answers this question by offering a new asymptotic theory about inference with missing data that is more general than existing theories. By using more powerful tools from real analysis and probability theory than those used in previous research, it proves that as the sample size increases and the extent of missingness decreases, the mean-loglikelihood function generated by partial data and that ignores the missingness mechanism will almost surely converge uniformly to that which would have been generated by complete data; and if the data are Missing at Random, this convergence depends only on sample size. Thus, inferences from partial data, such as posterior modes, uncertainty estimates, confidence intervals, likelihood ratios, test statistics, and indeed, all quantities or features derived from the partial-data loglikelihood function, will be consistently estimated. They will approximate their complete-data analogues. This adds to previous research which has only proved the consistency and asymptotic normality of the posterior mode, and developed separate theories for Direct-Likelihood, Bayesian, and Frequentist inference. Practical implications of this result are discussed, and the theory is verified using a previous study of International Human Rights Law.

可理解性 · COVID-19 · 有偏 · entity · CASES ·

2022 年 1 月 10 日

Understanding COVID-19 Effects on Mobility: A Community-Engaged Approach

Arun Sharma,Majid Farhadloo,Yan Li,Aditya Kulkarni,Jayant Gupta,Shashi Shekhar

Given aggregated mobile device data, the goal is to understand the impact of COVID-19 policy interventions on mobility. This problem is vital due to important societal use cases, such as safely reopening the economy. Challenges include understanding and interpreting questions of interest to policymakers, cross-jurisdictional variability in choice and time of interventions, the large data volume, and unknown sampling bias. The related work has explored the COVID-19 impact on travel distance, time spent at home, and the number of visitors at different points of interest. However, many policymakers are interested in long-duration visits to high-risk business categories and understanding the spatial selection bias to interpret summary reports. We provide an Entity Relationship diagram, system architecture, and implementation to support queries on long-duration visits in addition to fine resolution device count maps to understand spatial bias. We closely collaborated with policymakers to derive the system requirements and evaluate the system components, the summary reports, and visualizations.

BERT · Networking · INFORMS · Performer · MoDELS ·

2019 年 10 月 28 日

Visualizing and Measuring the Geometry of BERT

Andy Coenen,Emily Reif,Ann Yuan,Been Kim,Adam Pearce,Fernanda Viégas,Martin Wattenberg

from arxiv, 8 pages, 5 figures

Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.

文本分類 · 可理解性 · 數據集 · FAST · MoDELS ·

2018 年 11 月 5 日

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Edward Collins,Nikolai Rozanov,Bingbing Zhang

from arxiv, 27 pages, 6 tables, 3 figures (submitted for publication in June 2018), CoNLL 2018

Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We show that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code ( //github.com/Wluper/edm ) and datasets ( //data.wluper.com ) are publicly available.