Data protection regulations, such as GDPR and CCPA, require websites and embedded third-parties, especially advertisers, to seek user consent before they can collect and process user data. Only when the users opt in, can these entities collect, process, and share user data. Websites typically incorporate Consent Management Platforms (CMPs), such as OneTrust and CookieBot, to solicit and convey user consent to the embedded advertisers, with the expectation that the consent will be respected. However, neither the websites nor the regulators currently have any mechanism to audit advertisers' compliance with the user consent, i.e., to determine if advertisers indeed do not collect, process, and share user data when the user opts out. In this paper, we propose an auditing framework that leverages advertisers' bidding behavior to empirically assess the violations of data protection regulations. Using our framework, we conduct a measurement study to evaluate two of the most widely deployed CMPs, i.e., OneTrust and CookieBot, as well as advertiser-offered opt-out controls, i.e., National Advertising Initiative's opt-out, under GDPR and CCPA -- arguably two of the most mature data protection regulations. Our results indicate that user data is unfortunately still being collected, processed, and shared even when users opt-out. Our findings suggest that several prominent advertisers (e.g., AppNexus, PubMatic) might be in potential violation of GDPR and CCPA. Overall, our work casts a doubt if regulations are effective at protecting users' online privacy.
Large scale adoption of large language models has introduced a new era of convenient knowledge transfer for a slew of natural language processing tasks. However, these models also run the risk of undermining user trust by exposing unwanted information about the data subjects, which may be extracted by a malicious party, e.g. through adversarial attacks. We present an empirical investigation into the extent of the personal information encoded into pre-trained representations by a range of popular models, and we show a positive correlation between the complexity of a model, the amount of data used in pre-training, and data leakage. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular privacy-preserving algorithms, on a large, multi-lingual dataset on sentiment analysis annotated with demographic information (location, age and gender). The results show since larger and more complex models are more prone to leaking private information, use of privacy-preserving methods is highly desirable. We also find that highly privacy-preserving technologies like differential privacy (DP) can have serious model utility effects, which can be ameliorated using hybrid or metric-DP techniques.
The human footprint is having a unique set of ridges unmatched by any other human being, and therefore it can be used in different identity documents for example birth certificate, Indian biometric identification system AADHAR card, driving license, PAN card, and passport. There are many instances of the crime scene where an accused must walk around and left the footwear impressions as well as barefoot prints and therefore, it is very crucial to recovering the footprints from identifying the criminals. Footprint-based biometric is a considerably newer technique for personal identification. Fingerprints, retina, iris and face recognition are the methods most useful for attendance record of the person. This time the world is facing the problem of global terrorism. It is challenging to identify the terrorist because they are living as regular as the citizens do. Their soft target includes the industries of special interests such as defence, silicon and nanotechnology chip manufacturing units, pharmacy sectors. They pretend themselves as religious persons, so temples and other holy places, even in markets is in their targets. These are the places where one can obtain their footprints quickly. The gait itself is sufficient to predict the behaviour of the suspects. The present research is driven to identify the usefulness of footprint and gait as an alternative to personal identification.
Modern web services routinely provide REST APIs for clients to access their functionality. These APIs present unique challenges and opportunities for automated testing, driving the recent development of many techniques and tools that generate test cases for API endpoints using various strategies. Understanding how these techniques compare to one another is difficult, as they have been evaluated on different benchmarks and using different metrics. To fill this gap, we performed an empirical study aimed to understand the landscape in automated testing of REST APIs and guide future research in this area. We first identified, through a systematic selection process, a set of 10 state-of-the-art REST API testing tools that included tools developed by both researchers and practitioners. We then applied these tools to a benchmark of 20 real-world open-source RESTful services and analyzed their performance in terms of code coverage achieved and unique failures triggered. This analysis allowed us to identify strengths, weaknesses, and limitations of the tools considered and of their underlying strategies, as well as implications of our findings for future research in this area.
Evaluation of keyword spotting (KWS) systems that detect keywords in speech is a challenging task under realistic privacy constraints. The KWS is designed to only collect data when the keyword is present, limiting the availability of hard samples that may contain false negatives, and preventing direct estimation of model recall from production data. Alternatively, complementary data collected from other sources may not be fully representative of the real application. In this work, we propose an evaluation technique which we call AB/BA analysis. Our framework evaluates a candidate KWS model B against a baseline model A, using cross-dataset offline decoding for relative recall estimation, without requiring negative examples. Moreover, we propose a formulation with assumptions that allow estimation of relative false positive rate between models with low variance even when the number of false positives is small. Finally, we propose to leverage machine-generated soft labels, in a technique we call Semi-Supervised AB/BA analysis, that improves the analysis time, privacy, and cost. Experiments with both simulation and real data show that AB/BA analysis is successful at measuring recall improvement in conjunction with the trade-off in relative false positive rate.
With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is mined from software projects, however, requires extensive processing and needs to be handled with utmost care to ensure valid conclusions. Since the software development practices and tools have changed over two decades, we aim to understand the state-of-the-art research workflows and to highlight potential challenges. We employ a systematic literature review by sampling over one thousand papers from leading conferences and by analyzing the 286 most relevant papers from the perspective of data workflows, methodologies, reproducibility, and tools. We found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature. Furthermore, we found a considerable number of papers provide little or no reproducibility instructions -- a substantial deficiency for a data-intensive field. In fact, 33% of papers provide no information on how their data was retrieved. Based on these findings, we propose ways to address these shortcomings via existing tools and also provide recommendations to improve research workflows and the reproducibility of research.
We study the performance of a phase-noise impaired double reconfigurable intelligent surface (RIS)-aided multiuser (MU) multiple-input single-output (MISO) system under spatial correlation at both RISs and base-station (BS). The downlink achievable rate is derived in closed-form under maximum ratio transmission (MRT) precoding. In addition, we obtain the optimal phase-shift design at both RISs in closed-form for the considered channel and phase-noise models. Numerical results validate the analytical expressions, and highlight the effects of different system parameters on the achievable rate. Our analysis shows that phase-noise can severely degrade the performance when users do not have direct links to both RISs, and can only be served via the double-reflection link. Also, we show that high spatial correlation at RISs is essential for high achievable rates.
Despite its technological benefits, Internet of Things (IoT) has cyber weaknesses due to the vulnerabilities in the wireless medium. Machine learning (ML)-based methods are widely used against cyber threats in IoT networks with promising performance. Advanced persistent threat (APT) is prominent for cybercriminals to compromise networks, and it is crucial to long-term and harmful characteristics. However, it is difficult to apply ML-based approaches to identify APT attacks to obtain a promising detection performance due to an extremely small percentage among normal traffic. There are limited surveys to fully investigate APT attacks in IoT networks due to the lack of public datasets with all types of APT attacks. It is worth to bridge the state-of-the-art in network attack detection with APT attack detection in a comprehensive review article. This survey article reviews the security challenges in IoT networks and presents the well-known attacks, APT attacks, and threat models in IoT systems. Meanwhile, signature-based, anomaly-based, and hybrid intrusion detection systems are summarized for IoT networks. The article highlights statistical insights regarding frequently applied ML-based methods against network intrusion alongside the number of attacks types detected. Finally, open issues and challenges for common network intrusion and APT attacks are presented for future research.
Blended learning (BL) is a recent tread among many options that can best fit learners' needs, regardless of time and place. This study aimed to discover students' perceptions of BL and the challenges faced by them while using technology. This quantitative study used data gathered from 300 students enrolled in four public universities in the Sindh province of Pakistan. the finding shows that students were compatible with the use of technology, and it has a positive effect on their academic experience. The study also showed that the use of technology encourages peer collaboration. The challenges found include: neither teacher support nor a training program was provided to the students for the course which needed to shift from a traditional face to face paradigm to a blended format, a lake of space lies with skills in a laboratory assistants for the courses with a blended format and as shortage of high tech computer laboratories / computer units to run these courses. Therefore, it is recommended that the authorities must develop and incorporate a comprehensive mechanism for the effective implementation of BL in the learning teaching-learning process heads of the departments should also provide additional computing infrastructure to their departments.
Fast developing artificial intelligence (AI) technology has enabled various applied systems deployed in the real world, impacting people's everyday lives. However, many current AI systems were found vulnerable to imperceptible attacks, biased against underrepresented groups, lacking in user privacy protection, etc., which not only degrades user experience but erodes the society's trust in all AI systems. In this review, we strive to provide AI practitioners a comprehensive guide towards building trustworthy AI systems. We first introduce the theoretical framework of important aspects of AI trustworthiness, including robustness, generalization, explainability, transparency, reproducibility, fairness, privacy preservation, alignment with human values, and accountability. We then survey leading approaches in these aspects in the industry. To unify the current fragmented approaches towards trustworthy AI, we propose a systematic approach that considers the entire lifecycle of AI systems, ranging from data acquisition to model development, to development and deployment, finally to continuous monitoring and governance. In this framework, we offer concrete action items to practitioners and societal stakeholders (e.g., researchers and regulators) to improve AI trustworthiness. Finally, we identify key opportunities and challenges in the future development of trustworthy AI systems, where we identify the need for paradigm shift towards comprehensive trustworthy AI systems.
As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.