Advancements in Artificial Intelligence, particularly with ChatGPT, have significantly impacted software development. Utilizing novel data from GitHub Innovation Graph, we hypothesize that ChatGPT enhances software production efficiency. Utilizing natural experiments where some governments banned ChatGPT, we employ Difference-in-Differences (DID), Synthetic Control (SC), and Synthetic Difference-in-Differences (SDID) methods to estimate its effects. Our findings indicate a significant positive impact on the number of git pushes, repositories, and unique developers per 100,000 people, particularly for high-level, general purpose, and shell scripting languages. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
This study examines the influence of perceived AI features on user motivation in virtual interactions. AI avatars, being disclosed as being an AI, or embodying specific genders, could be used in user-AI interactions. Leveraging insights from AI and avatar research, we explore how AI disclosure and gender affect user motivation. We conducted a game-based experiment involving over 72,500 participants who solved search problems alone or with an AI companion. Different groups experienced varying AI appearances and disclosures. We measured play intensity. Results revealed that the presence of another avatar led to less intense play compared to solo play. Disclosure of the avatar as AI heightened effort intensity compared to non-disclosed AI companions. Additionally, a masculine AI appearance reduced effort intensity.
With the recent advancement of Artificial Intelligence (AI) and Large Language Models (LLMs), AI-based code generation tools become a practical solution for software development. GitHub Copilot, the AI pair programmer, utilizes machine learning models trained on a large corpus of code snippets to generate code suggestions using natural language processing. Despite its popularity in software development, there is limited empirical evidence on the actual experiences of practitioners who work with Copilot. To this end, we conducted an empirical study to understand the problems that practitioners face when using Copilot, as well as their underlying causes and potential solutions. We collected data from 473 GitHub issues, 706 GitHub discussions, and 142 Stack Overflow posts. Our results reveal that (1) Operation Issue and Compatibility Issue are the most common problems faced by Copilot users, (2) Copilot Internal Error, Network Connection Error, and Editor/IDE Compatibility Issue are identified as the most frequent causes, and (3) Bug Fixed by Copilot, Modify Configuration/Setting, and Use Suitable Version are the predominant solutions. Based on the results, we discuss the potential areas of Copilot for enhancement, and provide the implications for the Copilot users, the Copilot team, and researchers.
Contributors to open source software packages often describe feeling discouraged by the lack of positive feedback from users. This paper describes a technology probe, Hug Reports, that provides users a communication affordance within their code editors, through which users can convey appreciation to contributors of packages they use. In our field study, 18 users interacted with the probe for 3 weeks, resulting in messages of appreciation to 550 contributors, 26 of whom participated in subsequent research. Our findings show how locating a communication affordance within the code editor, and allowing users to express appreciation in terms of the abstractions they are exposed to (packages, modules, functions), can support exchanges of appreciation that are meaningful to users and contributors. Findings also revealed the moments in which users expressed appreciation, the two meanings that appreciation took on -- as a measure of utility and as an act of expressive communication -- and how contributors' reactions to appreciation were influenced by their perceived level of contribution. Based on these findings, we discuss opportunities and challenges for designing appreciation systems for open source in particular, and peer production communities more generally.
Phishing attacks attempt to deceive users into stealing sensitive information, posing a significant cybersecurity threat. Advances in machine learning (ML) and deep learning (DL) have led to the development of numerous phishing webpage detection solutions, but these models remain vulnerable to adversarial attacks. Evaluating their robustness against adversarial phishing webpages is essential. Existing tools contain datasets of pre-designed phishing webpages for a limited number of brands, and lack diversity in phishing features. To address these challenges, we develop PhishOracle, a tool that generates adversarial phishing webpages by embedding diverse phishing features into legitimate webpages. We evaluate the robustness of two existing models, Stack model and Phishpedia, in classifying PhishOracle-generated adversarial phishing webpages. Additionally, we study a commercial large language model, Gemini Pro Vision, in the context of adversarial attacks. We conduct a user study to determine whether PhishOracle-generated adversarial phishing webpages deceive users. Our findings reveal that many PhishOracle-generated phishing webpages evade current phishing webpage detection models and deceive users, but Gemini Pro Vision is robust to the attack. We also develop the PhishOracle web app, allowing users to input a legitimate URL, select relevant phishing features and generate a corresponding phishing webpage. All resources are publicly available on GitHub.
Federated Learning (FL) in the Internet of Things (IoT) environments can enhance machine learning by utilising decentralised data, but at the same time, it might introduce significant privacy and security concerns due to the constrained nature of IoT devices. This represents a research challenge that we aim to address in this paper. We systematically analysed recent literature to identify privacy threats in FL within IoT environments, and evaluate the defensive measures that can be employed to mitigate these threats. Using a Systematic Literature Review (SLR) approach, we searched five publication databases (Scopus, IEEE Xplore, Wiley, ACM, and Science Direct), collating relevant papers published between 2017 and April 2024, a period which spans from the introduction of FL until now. Guided by the PRISMA protocol, we selected 49 papers to focus our systematic review on. We analysed these papers, paying special attention to the privacy threats and defensive measures -- specifically within the context of IoT -- using inclusion and exclusion criteria tailored to highlight recent advances and critical insights. We identified various privacy threats, including inference attacks, poisoning attacks, and eavesdropping, along with defensive measures such as Differential Privacy and Secure Multi-Party Computation. These defences were evaluated for their effectiveness in protecting privacy without compromising the functional integrity of FL in IoT settings. Our review underscores the necessity for robust and efficient privacy-preserving strategies tailored for IoT environments. Notably, there is a need for strategies against replay, evasion, and model stealing attacks. Exploring lightweight defensive measures and emerging technologies such as blockchain may help improve the privacy of FL in IoT, leading to the creation of FL models that can operate under variable network conditions.
Large Language Models (LLMs) have demonstrated remarkable capabilities in executing tasks based on natural language queries. However, these models, trained on curated datasets, inherently embody biases ranging from racial to national and gender biases. It remains uncertain whether these biases impact the performance of LLMs for certain tasks. In this study, we investigate the political biases of LLMs within the stance classification task, specifically examining whether these models exhibit a tendency to more accurately classify politically-charged stances. Utilizing three datasets, seven LLMs, and four distinct prompting schemes, we analyze the performance of LLMs on politically oriented statements and targets. Our findings reveal a statistically significant difference in the performance of LLMs across various politically oriented stance classification tasks. Furthermore, we observe that this difference primarily manifests at the dataset level, with models and prompting schemes showing statistically similar performances across different stance classification datasets. Lastly, we observe that when there is greater ambiguity in the target the statement is directed towards, LLMs have poorer stance classification accuracy.
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. While QA datasets are plentiful in areas like general domain and biomedicine, academic chemistry is less explored. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. We first address the issue of imbalanced label distribution by re-weighting the instance-wise loss based on the inverse frequency of each class, ensuring minority classes are not dominated by majority ones during optimization. Next, we utilize the unlabeled data to enrich the learning process, generating a variety of augmentations based on a SoftMix operation and ensuring their predictions align with the same target, i.e., pseudo-labels. To ensure the quality of the pseudo-labels, we propose a calibration procedure aimed at closely aligning the pseudo-label estimates of individual samples with a desired ground truth distribution. Experiments show that our QAMatch significantly outperforms the recent similar-scale baselines and Large Language Models (LLMs) not only on our ScholarChemQA dataset but also on four benchmark datasets. We hope our benchmark and model can facilitate and promote more research on chemical QA.
Quantum multiprover interactive proof systems with entanglement MIP* are much more powerful than its classical counterpart MIP (Babai et al. '91, Ji et al. '20): while MIP = NEXP, the quantum class MIP* is equal to RE, a class including the halting problem. This is because the provers in MIP* can share unbounded quantum entanglement. However, recent works of Qin and Yao '21 and '23 have shown that this advantage is significantly reduced if the provers' shared state contains noise. This paper attempts to exactly characterize the effect of noise on the computational power of quantum multiprover interactive proof systems. We investigate the quantum two-prover one-round interactive system MIP*[poly, O(1)], where the verifier sends polynomially many bits to the provers and the provers send back constantly many bits. We show noise completely destroys the computational advantage given by shared entanglement in this model. Specifically, we show that if the provers are allowed to share arbitrarily many noisy EPR states, where each EPR state is affected by an arbitrarily small constant amount of noise, the resulting complexity class is equivalent to NEXP = MIP. This improves significantly on the previous best-known bound of NEEEXP (nondeterministic triply exponential time) by Qin and Yao '21. We also show that this collapse in power is due to the noise, rather than the O(1) answer size, by showing that allowing for noiseless EPR states gives the class the full power of RE = MIP*[poly, poly]. Along the way, we develop two technical tools of independent interest. First, we give a new, deterministic tester for the positivity of an exponentially large matrix, provided it has a low-degree Fourier decomposition in terms of Pauli matrices. Secondly, we develop a new invariance principle for smooth matrix functions having bounded third-order Fr\'echet derivatives or which are Lipschitz continous.
Scaling to larger systems, with current levels of reliability, requires cost-effective methods to mitigate hardware failures. One of the main causes of hardware failure is an uncorrected error in memory, which terminates the current job and wastes all computation since the last checkpoint. This paper presents the first adaptive method for triggering uncorrected error mitigation. It uses a prediction approach that considers the likelihood of an uncorrected error and its current potential cost. The method is based on reinforcement learning, and the only user-defined parameters are the mitigation cost and whether the job can be restarted from a mitigation point. We evaluate our method using classical machine learning metrics together with a cost-benefit analysis, which compares the cost of mitigation actions with the benefits from mitigating some of the errors. On two years of production logs from the MareNostrum supercomputer, our method reduces lost compute time by 54% compared with no mitigation and is just 6% below the optimal Oracle method. All source code is open source.
Sustainability is becoming a key property of modern software systems. While there is a substantial and growing body of knowledge on engineering sustainable software, end-to-end frameworks that situate sustainability-related activities within the software delivery lifecycle are missing. In this article, we propose the SusDevOps framework that promotes sustainability to a first principle within a DevOps context. We demonstrate the lifecycle phases and techniques of SusDevOps through the case of a software development startup company.