Behaviour-Driven Development (BDD) has emerged in the last years as a powerful methodology to specify testable and executable user requirements through stories and scenarios. With the support of external testing frameworks, BDD stories can be used to automatically assess the behavior of a fully functional software system. This article describes a toolset which extends BDD with the aim of providing automated assessment also for user interface design artifacts to ensure their consistency with the user requirements since the beginning of a software project. The approach has been evaluated by exploiting previously specified user requirements for a web system to book business trips. Such requirements gave rise to a set of BDD stories that have been refined and used to automatically assess the consistency of task models, graphical user interface (GUI) prototypes, and final GUIs of the system. The results have shown that our approach was able to identify different types of inconsistencies in the set of analyzed artifacts and consistently keep the semantic traces between them.
The advent of Cloud Computing enabled the proliferation of IoT applications for smart environments. However, the distance of these resources makes them unsuitable for delay-sensitive applications. Hence, Fog Computing has emerged to provide such capabilities in proximity to end devices through distributed resources. These limited resources can collaborate to serve distributed IoT application workflows using the concept of stateless micro Fog service replicas, which provides resiliency and maintains service availability in the face of failures. Load balancing supports this collaboration by optimally assigning workloads to appropriate services, i.e., distributing the load among Fog nodes to fairly utilize compute and network resources and minimize execution delays. In this paper, we propose using ELECTRE, a Multi-Criteria Decision Analysis (MCDA) approach, to efficiently balance the load in Fog environments. We considered multiple objectives to make service selection decisions, including compute and network load information. We evaluate our approach in a realistic unbalanced topological setup with heterogeneous workload requirements. To the best of our knowledge, this is the first time ELECTRE-based methods are used to balance the load in Fog environments. Through simulations, we compared the performance of our proposed approach with traditional baseline methods that are commonly used in practice, namely random, Round-Robin, nearest node, and fastest service selection algorithms. In terms of the overall system performance, our approach outperforms these methods with up to 67% improvement.
IQUAFLOW is a new image quality framework that provides a set of tools to assess image quality. The user can add custom metrics that can be easily integrated. Furthermore, iquaflow allows to measure quality by using the performance of AI models trained on the images as a proxy. This also helps to easily make studies of performance degradation of several modifications of the original dataset, for instance, with images reconstructed after different levels of lossy compression; satellite images would be a use case example, since they are commonly compressed before downloading to the ground. In this situation, the optimization problem consists in finding the smallest images that provide yet sufficient quality to meet the required performance of the deep learning algorithms. Thus, a study with iquaflow is suitable for such case. All this development is wrapped in Mlflow: an interactive tool used to visualize and summarize the results. This document describes different use cases and provides links to their respective repositories. To ease the creation of new studies, we include a cookie-cutter repository. The source code, issue tracker and aforementioned repositories are all hosted on GitHub //github.com/satellogic/iquaflow.
Code pre-trained models (CodePTMs) have recently demonstrated significant success in code intelligence. To interpret these models, some probing methods have been applied. However, these methods fail to consider the inherent characteristics of codes. In this paper, to address the problem, we propose a novel probing method CAT-probing to quantitatively interpret how CodePTMs attend code structure. We first denoise the input code sequences based on the token types pre-defined by the compilers to filter those tokens whose attention scores are too small. After that, we define a new metric CAT-score to measure the commonality between the token-level attention scores generated in CodePTMs and the pair-wise distances between corresponding AST nodes. The higher the CAT-score, the stronger the ability of CodePTMs to capture code structure. We conduct extensive experiments to integrate CAT-probing with representative CodePTMs for different programming languages. Experimental results show the effectiveness of CAT-probing in CodePTM interpretation. Our codes and data are publicly available at //github.com/nchen909/CodeAttention.
Context: Developing software-intensive products or services usually involves a plethora of software artefacts. Assets are artefacts intended to be used more than once and have value for organisations; examples include test cases, code, requirements, and documentation. During the development process, assets might degrade, affecting the effectiveness and efficiency of the development process. Therefore, assets are an investment that requires continuous management. Identifying assets is the first step for their effective management. However, there is a lack of awareness of what assets and types of assets are common in software-developing organisations. Most types of assets are understudied, and their state of quality and how they degrade over time have not been well-understood. Method: We perform a systematic literature review and a field study at five companies to study and identify assets to fill the gap in research. The results were analysed qualitatively and summarised in a taxonomy. Results: We create the first comprehensive, structured, yet extendable taxonomy of assets, containing 57 types of assets. Conclusions: The taxonomy serves as a foundation for identifying assets that are relevant for an organisation and enables the study of asset management and asset degradation concepts.
We explore the features of a user interface where formal proofs can be built through gestural actions. In particular, we show how proof construction steps can be associated to drag-and-drop actions. We argue that this can provide quick and intuitive proof construction steps. This work builds on theoretical tools coming from deep inference. It also resumes and integrates some ideas of the former proof-by-pointing project.
We study the effectiveness of information design in reducing congestion in social services catering to users with varied levels of need. In the absence of price discrimination and centralized admission, the provider relies on sharing information about wait times to improve welfare. We consider a stylized model with heterogeneous users who differ in their private outside options: low-need users have an acceptable outside option to the social service, whereas high-need users have no viable outside option. Upon arrival, a user decides to wait for the service by joining an unobservable first-come-first-serve queue, or leave and seek her outside option. To reduce congestion and improve social outcomes, the service provider seeks to persuade more low-need users to avail their outside option, and thus better serve high-need users. We characterize the Pareto-optimal signaling mechanisms and compare their welfare outcomes against several benchmarks. We show that if either type is the overwhelming majority of the population, information design does not provide improvement over sharing full information or no information. On the other hand, when the population is a mixture of the two types, information design not only Pareto dominates full-information and no-information mechanisms, in some regimes it also achieves the same welfare as the "first-best", i.e., the Pareto-optimal centralized admission policy with knowledge of users' types.
In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend the budget, but show that typical budget sizes often do not allow for solid comparison. Taking the perspective that the basis of solid comparison is in achieving statistical significance, we study the power (rate of achieving significance) on a large collection of pairwise DA comparisons. Due to the nature of statistical estimation, power is low for differentiating less than 1-2 DA points, and to achieve a notable increase in power requires at least 2-3x more samples. Applying variance reduction alone will not yield these gains, so we must face the reality of undetectable differences and spending increases. In this context, we propose interim testing, an "early stopping" collection procedure that yields more power per judgment collected, which adaptively focuses the budget on pairs that are borderline significant. Interim testing can achieve up to a 27% efficiency gain when spending 3x the current budget, or 18% savings at the current evaluation power.
Explainable AI (XAI) is widely viewed as a sine qua non for ever-expanding AI research. A better understanding of the needs of XAI users, as well as human-centered evaluations of explainable models are both a necessity and a challenge. In this paper, we explore how HCI and AI researchers conduct user studies in XAI applications based on a systematic literature review. After identifying and thoroughly analyzing 85 core papers with human-based XAI evaluations over the past five years, we categorize them along the measured characteristics of explanatory methods, namely trust, understanding, fairness, usability, and human-AI team performance. Our research shows that XAI is spreading more rapidly in certain application domains, such as recommender systems than in others, but that user evaluations are still rather sparse and incorporate hardly any insights from cognitive or social sciences. Based on a comprehensive discussion of best practices, i.e., common models, design choices, and measures in user studies, we propose practical guidelines on designing and conducting user studies for XAI researchers and practitioners. Lastly, this survey also highlights several open research directions, particularly linking psychological science and human-centered XAI.
We describe ACE0, a lightweight platform for evaluating the suitability and viability of AI methods for behaviour discovery in multiagent simulations. Specifically, ACE0 was designed to explore AI methods for multi-agent simulations used in operations research studies related to new technologies such as autonomous aircraft. Simulation environments used in production are often high-fidelity, complex, require significant domain knowledge and as a result have high R&D costs. Minimal and lightweight simulation environments can help researchers and engineers evaluate the viability of new AI technologies for behaviour discovery in a more agile and potentially cost effective manner. In this paper we describe the motivation for the development of ACE0.We provide a technical overview of the system architecture, describe a case study of behaviour discovery in the aerospace domain, and provide a qualitative evaluation of the system. The evaluation includes a brief description of collaborative research projects with academic partners, exploring different AI behaviour discovery methods.
To address the sparsity and cold start problem of collaborative filtering, researchers usually make use of side information, such as social networks or item attributes, to improve recommendation performance. This paper considers the knowledge graph as the source of side information. To address the limitations of existing embedding-based and path-based methods for knowledge-graph-aware recommendation, we propose Ripple Network, an end-to-end framework that naturally incorporates the knowledge graph into recommender systems. Similar to actual ripples propagating on the surface of water, Ripple Network stimulates the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user's potential interests along links in the knowledge graph. The multiple "ripples" activated by a user's historically clicked items are thus superposed to form the preference distribution of the user with respect to a candidate item, which could be used for predicting the final clicking probability. Through extensive experiments on real-world datasets, we demonstrate that Ripple Network achieves substantial gains in a variety of scenarios, including movie, book and news recommendation, over several state-of-the-art baselines.