Performance issues in Android applications significantly undermine users' experience, engagement, and retention, which is a long-lasting research topic in academia. Unlike functionality issues, performance issues are more difficult to diagnose and resolve due to their complex root causes, which often emerge only under specific conditions or payloads. Although many efforts haven attempt to mitigate the impact of performance issues by developing methods to automatically identify and resolve them, it remains unclear if this objective has been fulfilled, and the existing approaches indeed targeted on the most critical performance issues encountered in real-world settings. To this end, we conducted a large-scale comparative study of Android performance issues in real-world applications and literature. Specifically, we started by investigating real-world performance issues, their underlying root causes (i.e., contributing factors), and common code patterns. We then took an additional step to empirically summarize existing approaches and datasets through a literature review, assessing how well academic research reflects the real-world challenges faced by developers and users. Our comparison results show a substantial divergence exists in the primary performance concerns of researchers, developers, and users. Among all the identified factors, 57.14% have not been examined in academic research, while a substantial 76.39% remain unaddressed by existing tools, and 66.67% lack corresponding datasets. This stark contrast underscores a substantial gap in our understanding and management of performance issues. Consequently, it is crucial for our community to intensify efforts to bridge these gaps and achieve comprehensive detection and resolution of performance issues.
Context. The security of critical infrastructure has been a fundamental concern since the advent of computers, and this concern has only intensified in today's cyber warfare landscape. Protecting mission-critical systems (MCSs), including essential assets like healthcare, telecommunications, and military coordination, is vital for national security. These systems require prompt and comprehensive governance to ensure their resilience, yet recent events have shown that meeting these demands is increasingly challenging. Aim. Building on prior research that demonstrated the potential of GAI, particularly Large Language Models (LLMs), in improving risk analysis tasks, we aim to explore practitioners' perspectives, specifically developers and security personnel, on using generative AI (GAI) in the governance of IT MCSs seeking to provide insights and recommendations for various stakeholders, including researchers, practitioners, and policymakers. Method. We designed a survey to collect practical experiences, concerns, and expectations of practitioners who develop and implement security solutions in the context of MCSs. Analyzing this data will help identify key trends, challenges, and opportunities for introducing GAIs in this niche domain. Conclusions and Future Works. Our findings highlight that the safe use of LLMs in MCS governance requires interdisciplinary collaboration. Researchers should focus on designing regulation-oriented models and focus on accountability; practitioners emphasize data protection and transparency, while policymakers must establish a unified AI framework with global benchmarks to ensure ethical and secure LLMs-based MCS governance.
Sequential Recommendation (SR) plays a critical role in predicting users' sequential preferences. Despite its growing prominence in various industries, the increasing scale of SR models incurs substantial computational costs and unpredictability, challenging developers to manage resources efficiently. Under this predicament, Scaling Laws have achieved significant success by examining the loss as models scale up. However, there remains a disparity between loss and model performance, which is of greater concern in practical applications. Moreover, as data continues to expand, it incorporates repetitive and inefficient data. In response, we introduce the Performance Law for SR models, which aims to theoretically investigate and model the relationship between model performance and data quality. Specifically, we first fit the HR and NDCG metrics to transformer-based SR models. Subsequently, we propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics. Our method enables accurate predictions across various dataset scales and model sizes, demonstrating a strong correlation in large SR models and offering insights into achieving optimal performance for any given model configuration.
Equity is a core concern of learning analytics. However, applications that teach and assess equity skills, particularly at scale are lacking, often due to barriers in evaluating language. Advances in generative AI via large language models (LLMs) are being used in a wide range of applications, with this present work assessing its use in the equity domain. We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations. We apply a mixed-method approach to analyze the performance of 81 undergraduate remote tutors. We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge in responding to middle school students experiencing possible inequities from pretest to posttest. Both GPT-4o and GPT-4-turbo demonstrate proficiency in assessing tutors ability to predict and explain the best approach. Balancing performance, efficiency, and cost, we determine that few-shot learning using GPT-4o is the preferred model. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts. Future work involves leveling the difficulty among scenarios and enhancing LLM prompts for large-scale grading and assessment.
Mobile apps are predominantly integrated with cloud services to benefit from enhanced functionalities. Adopting authentication using secrets such as API keys is crucial to ensure secure mobile-cloud interactions. However, developers often overlook the proper storage of such secrets, opting to put them directly into their projects. These secrets are checked into the projects and can be easily extracted and exploited by malicious adversaries. While many researchers investigated the issue of checked-in secret in open-source projects, there is a notable research gap concerning checked-in secrets in Android apps deployed on platforms such as Google Play Store. Unlike open-source projects, the lack of direct access to the source code and the presence of obfuscation complicates the checked-in secret detection for Android apps. This motivates us to conduct an empirical analysis to measure and compare the performance of different checked-in secret detection tools on Android apps. We first conducted a literature review to find all the checked-in secret detection tools that can be applied to Android apps. Then, we evaluate three representative tools on 5,135 Android apps, comparing their performance and analyzing their limitations. Our experiment reveals 2,142 checked-in secrets affecting 2,115 Android apps. We also disclose that the current checked-in secret detection techniques suffer from key limitations. All of the evaluated tools can miss a significant number of checked-in secrets in Android apps. Nevertheless, we observed that the tools are complimentary, suggesting the possibility of developing a more effective checked-in secret detection tool by combining their insights. Additionally, we propose that analyzing string groups within methods containing checked-in secrets may provide a more effective strategy to overcome obfuscation challenges.
As one of the most popular software applications, a web application is a program, accessible through the web, to dynamically generate content based on user interactions or contextual data, for example, online shopping platforms, social networking sites, and financial services. Web applications operate in diverse environments and leverage web technologies such as HTML, CSS, JavaScript, and Ajax, often incorporating features like asynchronous operations to enhance user experience. Due to the increasing user and popularity of web applications, approaches to their quality have become increasingly important. Web Application Testing (WAT) plays a vital role in ensuring web applications' functionality, security, and reliability. Given the speed with which web technologies are evolving, WAT is especially important. Over the last decade, various WAT approaches have been developed. The diversity of approaches reflects the many aspects of web applications, such as dynamic content, asynchronous operations, and diverse user environments. This paper provides a comprehensive overview of the main achievements during the past decade: It examines the main steps involved in WAT, including test-case generation and execution, and evaluation and assessment. The currently available tools for WAT are also examined. The paper also discusses some open research challenges and potential future WAT work.
As generative AI technologies find more and more real-world applications, the importance of testing their performance and safety seems paramount. ``Red-teaming'' has quickly become the primary approach to test AI models--prioritized by AI companies, and enshrined in AI policy and regulation. Members of red teams act as adversaries, probing AI systems to test their safety mechanisms and uncover vulnerabilities. Yet we know too little about this work and its implications. This essay calls for collaboration between computer scientists and social scientists to study the sociotechnical systems surrounding AI technologies, including the work of red-teaming, to avoid repeating the mistakes of the recent past. We highlight the importance of understanding the values and assumptions behind red-teaming, the labor involved, and the psychological impacts on red-teamers.
Advancements in large language models (LLMs) have paved the way for LLM-based agent systems that offer enhanced accuracy and interpretability across various domains. Radiology, with its complex analytical requirements, is an ideal field for the application of these agents. This paper aims to investigate the pre-requisite question for building concrete radiology agents which is, `Can modern LLMs act as agent cores in radiology environments?' To investigate it, we introduce RadABench with three-fold contributions: First, we present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents, generated from an extensive taxonomy encompassing 6 anatomies, 5 imaging modalities, 10 tool categories, and 11 radiology tasks. Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow and the capability to simulate a wide range of radiology toolsets. Third, we assess the performance of 7 leading LLMs on our benchmark from 5 perspectives with multiple metrics. Our findings indicate that while current LLMs demonstrate strong capabilities in many areas, they are still not sufficiently advanced to serve as the central agent core in a fully operational radiology agent system. Additionally, we identify key factors influencing the performance of LLM-based agent cores, offering insights for clinicians on how to apply agent systems in real-world radiology practices effectively. All of our code and data are open-sourced in //github.com/MAGIC-AI4Med/RadABench.
Bipedal robots are gaining global recognition due to their potential applications and advancements in artificial intelligence, particularly through Deep Reinforcement Learning (DRL). While DRL has significantly advanced bipedal locomotion, the development of a unified framework capable of handling a wide range of tasks remains an ongoing challenge. This survey systematically categorises, compares, and analyses existing DRL frameworks for bipedal locomotion, organising them into end-to-end and hierarchical control schemes. End-to-end frameworks are evaluated based on their learning approaches, while hierarchical frameworks are examined in terms of layered structures that integrate learning-based or traditional model-based methods. We provide a detailed evaluation of the composition, strengths, limitations, and capabilities of each framework. Additionally, this survey identifies key research gaps and proposes future directions aimed at creating a more integrated and efficient framework for bipedal locomotion, with wide-ranging applications in real-world environments.
Empathetic response generation, aiming to understand the user's situation and feelings and respond empathically, is crucial in building human-like dialogue systems. Traditional approaches typically employ maximum likelihood estimation as the optimization objective during training, yet fail to align the empathy levels between generated and target responses. To this end, we propose an empathetic response generation framework using reinforcement learning (EmpRL). The framework develops an effective empathy reward function and generates empathetic responses by maximizing the expected reward through reinforcement learning. EmpRL utilizes the pre-trained T5 model as the generator and further fine-tunes it to initialize the policy. To align the empathy levels between generated and target responses within a given context, an empathy reward function containing three empathy communication mechanisms -- emotional reaction, interpretation, and exploration -- is constructed using pre-designed and pre-trained empathy identifiers. During reinforcement learning training, the proximal policy optimization algorithm is used to fine-tune the policy, enabling the generation of empathetic responses. Both automatic and human evaluations demonstrate that the proposed EmpRL framework significantly improves the quality of generated responses, enhances the similarity in empathy levels between generated and target responses, and produces empathetic responses covering both affective and cognitive aspects.
To solve the information explosion problem and enhance user experience in various online applications, recommender systems have been developed to model users preferences. Although numerous efforts have been made toward more personalized recommendations, recommender systems still suffer from several challenges, such as data sparsity and cold start. In recent years, generating recommendations with the knowledge graph as side information has attracted considerable interest. Such an approach can not only alleviate the abovementioned issues for a more accurate recommendation, but also provide explanations for recommended items. In this paper, we conduct a systematical survey of knowledge graph-based recommender systems. We collect recently published papers in this field and summarize them from two perspectives. On the one hand, we investigate the proposed algorithms by focusing on how the papers utilize the knowledge graph for accurate and explainable recommendation. On the other hand, we introduce datasets used in these works. Finally, we propose several potential research directions in this field.