The California Consumer Privacy Act (CCPA) provides California residents with a range of enhanced privacy protections and rights. Our research investigated the extent to which Android app developers comply with the provisions of the CCPA that require them to provide consumers with accurate privacy notices and respond to "verifiable consumer requests" (VCRs) by disclosing personal information that they have collected, used, or shared about consumers for a business or commercial purpose. We compared the actual network traffic of 109 apps that we believe must comply with the CCPA to the data that apps state they collect in their privacy policies and the data contained in responses to "right to know" requests that we submitted to the app's developers. Of the 69 app developers who substantively replied to our requests, all but one provided specific pieces of personal data (as opposed to only categorical information). However, a significant percentage of apps collected information that was not disclosed, including identifiers (55 apps, 80%), geolocation data (21 apps, 30%), and sensory data (18 apps, 26%) among other categories. We discuss improvements to the CCPA that could help app developers comply with "right to know" requests and other related regulations.
We conducted a data collection on the basis of the Google AudioSet database by selecting a subset of the samples annotated with \textit{laughter}. The selection criterion was to be present a communicative act with clear connotation of being either positive (laughing with) or negative (being laughed at). On the basis of this annotated data, we performed two experiments: on the one hand, we manually extract and analyze phonetic features. On the other hand, we conduct several machine learning experiments by systematically combining several automatically extracted acoustic feature sets with machine learning algorithms. This shows that the best performing models can achieve and unweighted average recall of .7.
Many recent technological advances (e.g. ChatGPT and search engines) are possible only because of massive amounts of user-generated data produced through user interactions with computing systems or scraped from the web (e.g. behavior logs, user-generated content, and artwork). However, data producers have little say in what data is captured, how it is used, or who it benefits. Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape. By synthesizing related literature that reconceptualizes the production of data for computing as ``data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers in their relationship with tech companies, e.g advocating for transparency about data reuse, creating feedback channels between data producers and companies, and potentially developing mechanisms to share data's revenue more broadly. In doing so, we characterize data labor with six important dimensions - legibility, end-use awareness, collaboration requirement, openness, replaceability, and livelihood overlap - based on the parallels between data labor and various other types of labor in the computing literature.
Large Language Models (LLMs) have demonstrated human-like intelligence and are widely used in various applications. However, LLMs still exhibit various kinds of inconsistency problems. Existing works mainly focus on the inconsistency issues within a single LLM, while we investigate the inter-consistency among multiple LLMs, which is critical for collaborating to solve a complex task. To examine whether LLMs can collaborate to ultimately achieve a consensus for the shared goal and whether LLMs easily change their viewpoints, we introduce a Formal Debate framework (FORD) With FORD, we conduct a three-stage debate aligned with real-world scenarios: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on the commonsense reasoning task, LLMs not only become more inter-consistent but also achieve higher performance. Moreover, we observe that stronger LLMs tend to dominate the debates by adhering to their perspectives, while weaker ones are more likely to change viewpoints. Additionally, we highlight the importance of a competent judge, such as GPT-4, to draw more proper conclusions. Our work contributes to understanding the inter-consistency among LLMs and lays the foundation for the development of future collaboration methods.
Federated learning (FL) shines through in the internet of things (IoT) with its ability to realize collaborative learning and improve learning efficiency by sharing client model parameters trained on local data. Although FL has been successfully applied to various domains, including driver monitoring applications (DMAs) on the internet of vehicles (IoV), its usages still face some open issues, such as data and system heterogeneity, large-scale parallelism communication resources, malicious attacks, and data poisoning. This paper proposes a federated transfer-ordered-personalized learning (FedTOP) framework to address the above problems and test on two real-world datasets with and without system heterogeneity. The performance of the three extensions, transfer, ordered, and personalized, is compared by an ablation study and achieves 92.32% and 95.96% accuracy on the test clients of two datasets, respectively. Compared to the baseline, there is a 462% improvement in accuracy and a 37.46% reduction in communication resource consumption. The results demonstrate that the proposed FedTOP can be used as a highly accurate, streamlined, privacy-preserving, cybersecurity-oriented, and personalized framework for DMA.
Taxi-demand prediction is an important application of machine learning that enables taxi-providing facilities to optimize their operations and city planners to improve transportation infrastructure and services. However, the use of sensitive data in these systems raises concerns about privacy and security. In this paper, we propose the use of federated learning for taxi-demand prediction that allows multiple parties to train a machine learning model on their own data while keeping the data private and secure. This can enable organizations to build models on data they otherwise would not be able to access. Evaluation with real-world data collected from 16 taxi service providers in Japan over a period of six months showed that the proposed system can predict the demand level accurately within 1\% error compared to a single model trained with integrated data.
Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.
The food supply chain has a number of challenges, including a lack of transparency and disengagement among stakeholders. By providing a transparent and traceable digital ledger of transactions and movements for all supply chain actors, blockchain technology can provide a resolution to these problems. We propose a blockchain-based system for tracking a product's full path, from its raw components to the finished item in the store. Many advantages of the offered system include improved quality assessment, increased product transparency and traceability, and sophisticated fraud detection capabilities. By reinventing the way transactions are carried out and enabling stakeholders to obtain a complete record of each product's journey, the system has the potential to completely alter the food supply chain. Also, by minimising inefficiencies, waste, and fraudulent activities that have a negative influence on the supply chain, the deployment of this system can remove limits imposed by the current supply chain. Overall, the suggested blockchain-based system has the potential to significantly increase the efficiency, transparency, and traceability of the food supply chain.
Markov Switching models have had increasing success in time series analysis due to their ability to capture the existence of unobserved discrete states in the dynamics of the variables under study. This result is generally obtained thanks to the inference on states derived from the so--called Hamilton filter. One of the open problems in this framework is the identification of the number of states, generally fixed a priori; it is in fact impossible to apply classical tests due to the problem of the nuisance parameters present only under the alternative hypothesis. In this work we show, by Monte Carlo simulations, that fuzzy clustering is able to reproduce the parametric state inference derived from the Hamilton filter and that the typical indices used in clustering to determine the number of groups can be used to identify the number of states in this framework. The procedure is very simple to apply, considering that it is performed (in a nonparametric way) independently of the data generation process and that the indicators we use are present in most statistical packages. A final application on real data completes the analysis.
Evidence retrieval is a core part of automatic fact-checking. Prior work makes simplifying assumptions in retrieval that depart from real-world use cases: either no access to evidence, access to evidence curated by a human fact-checker, or access to evidence available long after the claim has been made. In this work, we present the first fully automated pipeline to check real-world claims by retrieving raw evidence from the web. We restrict our retriever to only search documents available prior to the claim's making, modeling the realistic scenario where an emerging claim needs to be checked. Our pipeline includes five components: claim decomposition, raw document retrieval, fine-grained evidence retrieval, claim-focused summarization, and veracity judgment. We conduct experiments on complex political claims in the ClaimDecomp dataset and show that the aggregated evidence produced by our pipeline improves veracity judgments. Human evaluation finds the evidence summary produced by our system is reliable (it does not hallucinate information) and relevant to answering key questions about a claim, suggesting that it can assist fact-checkers even when it cannot surface a complete evidence set.
The massive streams of Internet of Things (IoT) data require a timely analysis to retain data usefulness. Stream processing systems (SPSs) enable this task, deriving knowledge from the IoT data in real-time. Such real-time analytics benefits many applications but can also be used to violate user privacy, as the IoT data collected from users or their vicinity is inherently sensitive. In this paper, we present our systematic look into privacy issues arising from the intersection of SPSs and IoT, identifying key research challenges towards achieving holistic privacy protection in SPSs and proposing the solutions.