Test flakiness forms a major testing concern. Flaky tests manifest non-deterministic outcomes that cripple continuous integration and lead developers to investigate false alerts. Industrial reports indicate that on a large scale, the accrual of flaky tests breaks the trust in test suites and entails significant computational cost. To alleviate this, practitioners are constrained to identify flaky tests and investigate their impact. To shed light on such mitigation mechanisms, we interview 14 practitioners with the aim to identify (i) the sources of flakiness within the testing ecosystem, (ii) the impacts of flakiness, (iii) the measures adopted by practitioners when addressing flakiness, and (iv) the automation opportunities for these measures. Our analysis shows that, besides the tests and code, flakiness stems from interactions between the system components, the testing infrastructure, and external factors. We also highlight the impact of flakiness on testing practices and product quality and show that the adoption of guidelines together with a stable infrastructure are key measures in mitigating the problem.
Compact binary systems emit gravitational radiation which is potentially detectable by current Earth bound detectors. Extracting these signals from the instruments' background noise is a complex problem and the computational cost of most current searches depends on the complexity of the source model. Deep learning may be capable of finding signals where current algorithms hit computational limits. Here we restrict our analysis to signals from non-spinning binary black holes and systematically test different strategies by which training data is presented to the networks. To assess the impact of the training strategies, we re-analyze the first published networks and directly compare them to an equivalent matched-filter search. We find that the deep learning algorithms can generalize low signal-to-noise ratio (SNR) signals to high SNR ones but not vice versa. As such, it is not beneficial to provide high SNR signals during training, and fastest convergence is achieved when low SNR samples are provided early on. During testing we found that the networks are sometimes unable to recover any signals when a false alarm probability $<10^{-3}$ is required. We resolve this restriction by applying a modification we call unbounded Softmax replacement (USR) after training. With this alteration we find that the machine learning search retains $\geq 91.5\%$ of the sensitivity of the matched-filter search down to a false-alarm rate of 1 per month.
In this paper, we assess existing technical proposals for content moderation in End-to-End Encryption (E2EE) services. First, we explain the various tools in the content moderation toolbox, how they are used, and the different phases of the moderation cycle, including detection of unwanted content. We then lay out a definition of encryption and E2EE, which includes privacy and security guarantees for end-users, before assessing current technical proposals for the detection of unwanted content in E2EE services against those guarantees. We find that technical approaches for user-reporting and meta-data analysis are the most likely to preserve privacy and security guarantees for end-users. Both provide effective tools that can detect significant amounts of different types of problematic content on E2EE services, including abusive and harassing messages, spam, mis- and disinformation, and CSAM, although more research is required to improve these tools and better measure their effectiveness. Conversely, we find that other techniques that purport to facilitate content detection in E2EE systems have the effect of undermining key security guarantees of E2EE systems.
Log-structured merge (LSM) trees offer efficient ingestion by appending incoming data, and thus, are widely used as the storage layer of production NoSQL data stores. To enable competitive read performance, LSM-trees periodically re-organize data to form a tree with levels of exponentially increasing capacity, through iterative compactions. Compactions fundamentally influence the performance of an LSM-engine in terms of write amplification, write throughput, point and range lookup performance, space amplification, and delete performance. Hence, choosing the appropriate compaction strategy is crucial and, at the same time, hard as the LSM-compaction design space is vast, largely unexplored, and has not been formally defined in the literature. As a result, most LSM-based engines use a fixed compaction strategy, typically hand-picked by an engineer, which decides how and when to compact data. In this paper, we present the design space of LSM-compactions, and evaluate state-of-the-art compaction strategies with respect to key performance metrics. Toward this goal, our first contribution is to introduce a set of four design primitives that can formally define any compaction strategy: (i) the compaction trigger, (ii) the data layout, (iii) the compaction granularity, and (iv) the data movement policy. Together, these primitives can synthesize both existing and completely new compaction strategies. Our second contribution is to experimentally analyze 10 compaction strategies. We present 12 observations and 7 high-level takeaway messages, which show how LSM systems can navigate the compaction design space.
Most governments employ a set of quasi-standard measures to fight COVID-19 including wearing masks, social distancing, virus testing, contact tracing, and vaccination. However, combining these measures into an efficient holistic pandemic response instrument is even more involved than anticipated. We argue that some non-trivial factors behind the varying effectiveness of these measures are selfish decision making and the differing national implementations of the response mechanism. In this paper, through simple games, we show the effect of individual incentives on the decisions made with respect to mask wearing, social distancing and vaccination, and how these may result in sub-optimal outcomes. We also demonstrate the responsibility of national authorities in designing these games properly regarding data transparency, the chosen policies and their influence on the preferred outcome. We promote a mechanism design approach: it is in the best interest of every government to carefully balance social good and response costs when implementing their respective pandemic response mechanism; moreover, there is no one-size-fits-all solution when designing an effective solution.
The increasing dependency of modern society on IT systems and infrastructures for essential services (e.g. internet banking, vehicular network, health-IT, etc.) coupled with the growing number of cyber incidents and security vulnerabilities have made Cyber Security Operations Centre (CSOC) undoubtedly vital. As such security operations monitoring is now an integral part of most business operations. SOCs (used interchangeably as CSOCs) are responsible for continuously and protectively monitoring business services, IT systems and infrastructures to identify vulnerabilities, detect cyber-attacks, security breaches, policy violations, and to respond to cyber incidents swiftly. They must also ensure that security events and alerts are triaged and analysed, while coordinating and managing cyber incidents to resolution. Because SOCs are vital, it is also necessary that SOCs are effective. But unfortunately, the effectiveness of SOCs are a widespread concern and a focus of boundless debate. In this paper, we identify and discuss some of the pertinent challenges to building an effective SOC. We investigate some of the factors contributing to the inefficiencies in SOCs and explain some of the challenges they face. Further, we provide and prioritise recommendations to addressing the identified issues.
The COVID-19 pandemic continues to affect the conduct of clinical trials globally. Complications may arise from pandemic-related operational challenges such as site closures, travel limitations and interruptions to the supply chain for the investigational product, or from health-related challenges such as COVID-19 infections. Some of these complications lead to unforeseen intercurrent events in the sense that they affect either the interpretation or the existence of the measurements associated with the clinical question of interest. In this article, we demonstrate how the ICH E9(R1) Addendum on estimands and sensitivity analyses provides a rigorous basis to discuss potential pandemic-related trial disruptions and to embed these disruptions in the context of study objectives and design elements. We introduce several hypothetical estimand strategies and review various causal inference and missing data methods, as well as a statistical method that combines unbiased and possibly biased estimators for estimation. To illustrate, we describe the features of a stylized trial, and how it may have been impacted by the pandemic. This stylized trial will then be re-visited by discussing the changes to the estimand and the estimator to account for pandemic disruptions. Finally, we outline considerations for designing future trials in the context of unforeseen disruptions.
Fast developing artificial intelligence (AI) technology has enabled various applied systems deployed in the real world, impacting people's everyday lives. However, many current AI systems were found vulnerable to imperceptible attacks, biased against underrepresented groups, lacking in user privacy protection, etc., which not only degrades user experience but erodes the society's trust in all AI systems. In this review, we strive to provide AI practitioners a comprehensive guide towards building trustworthy AI systems. We first introduce the theoretical framework of important aspects of AI trustworthiness, including robustness, generalization, explainability, transparency, reproducibility, fairness, privacy preservation, alignment with human values, and accountability. We then survey leading approaches in these aspects in the industry. To unify the current fragmented approaches towards trustworthy AI, we propose a systematic approach that considers the entire lifecycle of AI systems, ranging from data acquisition to model development, to development and deployment, finally to continuous monitoring and governance. In this framework, we offer concrete action items to practitioners and societal stakeholders (e.g., researchers and regulators) to improve AI trustworthiness. Finally, we identify key opportunities and challenges in the future development of trustworthy AI systems, where we identify the need for paradigm shift towards comprehensive trustworthy AI systems.
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, in the last few years, a large research effort has been devoted to image captioning, i.e. the task of describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoding step and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, and relationships and the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results obtained, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches, from visual encoding and text generation to training strategies, used datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in image captioning architectures and training strategies. Moreover, many variants of the problem and its open challenges are analyzed and discussed. The final goal of this work is to serve as a tool for understanding the existing state-of-the-art and highlighting the future directions for an area of research where Computer Vision and Natural Language Processing can find an optimal synergy.
When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards a more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.
The task of {\em data fusion} is to identify the true values of data items (eg, the true date of birth for {\em Tom Cruise}) among multiple observed values drawn from different sources (eg, Web sites) of varying (and unknown) reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: {\em knowledge fusion}. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.