Autism Spectrum Disorder (ASD) is a neurodivergent condition with a wide range of characteristics and support levels. Individuals with ASD can exhibit various combinations of traits such as difficulties in social interaction, communication, and language, alongside restricted interests and repetitive activities. Many adults with ASD live independently due to increased awareness and late diagnoses, which help them manage long-standing challenges. Predictability, clarity, and minimized sensory stimuli are crucial for the daily comfort of autistic individuals. In mobile applications, autistic users face significant cognitive overload compared to neurotypicals, resulting in higher effort and time to complete tasks. Urban mobility apps, essential for daily routines, often overlook the needs of autistic users, leading to cognitive overload issues. This study investigates the accessibility of urban mobility apps for autistic individuals using the Interfaces Accessibility Guide for Autism (GAIA). By evaluating various apps, we have identified a common gap regarding accessibility for people with Autism Spectrum Disorder (ASD). This limitation relates to the absence of a functionality that allows users on the autism spectrum to customize the characteristics of the textual and visual elements of the software, such as changing the text font, altering the font type, and adjusting text colors, as well as native audio guidance within the applications themselves. Currently, the only function in this context is for visually impaired people, which completely changes the user experience in terms of navigation.
Educational Process Mining (EPM) is a data analysis technique that is used to improve educational processes. It is based on Process Mining (PM), which involves gathering records (logs) of events to discover process models and analyze the data from a process-centric perspective. One specific application of EPM is curriculum mining, which focuses on understanding the learning program students follow to achieve educational goals. This is important for institutional curriculum decision-making and quality improvement. Therefore, academic institutions can benefit from organizing the existing techniques, capabilities, and limitations. We conducted a systematic literature review to identify works on applying PM to curricular analysis and provide insights for further research. We reviewed 27 primary studies published across seven major databases. Our analysis classified these studies into five main research objectives: discovery of educational trajectories, identification of deviations in student behavior, bottleneck analysis, dropout / stopout analysis, and generation of recommendations. Key findings highlight challenges such as standardization to facilitate cross-university analysis, better integration of process and data mining techniques, and improved tools for educational stakeholders. This review provides a comprehensive overview of the current landscape in curricular process mining and outlines specific research opportunities to support more robust and actionable curricular analyses in educational settings.
With the rise of sophisticated phishing attacks, there is a growing need for effective and economical detection solutions. This paper explores the use of large multimodal agents, specifically Gemini 1.5 Flash and GPT-4o mini, to analyze both URLs and webpage screenshots via APIs, thus avoiding the complexities of training and maintaining AI systems. Our findings indicate that integrating these two data types substantially enhances detection performance over using either type alone. However, API usage incurs costs per query that depend on the number of input and output tokens. To address this, we propose a two-tiered agentic approach: initially, one agent assesses the URL, and if inconclusive, a second agent evaluates both the URL and the screenshot. This method not only maintains robust detection performance but also significantly reduces API costs by minimizing unnecessary multi-input queries. Cost analysis shows that with the agentic approach, GPT-4o mini can process about 4.2 times as many websites per $100 compared to the multimodal approach (107,440 vs. 25,626), and Gemini 1.5 Flash can process about 2.6 times more websites (2,232,142 vs. 862,068). These findings underscore the significant economic benefits of the agentic approach over the multimodal method, providing a viable solution for organizations aiming to leverage advanced AI for phishing detection while controlling expenses.
Recently, Large Language Model (LLM)-based Fault Localization (FL) techniques have been proposed, and showed improved performance with explanations on FL results. However, a major issue with LLM-based FL techniques is their heavy reliance on LLMs, which are often unreliable, expensive, and difficult to analyze or improve. When results are unsatisfactory, it is challenging both to determine a cause and to refine a technique for better outcomes. To address this issue, we propose LogicFL, a novel logical fault localization technique for Null Pointer Exceptions (NPEs). With logic programming, LogicFL imitates human developers' deduction process of fault localization, and identifies causes of NPEs after logical inferences on collected facts about faulty code and test execution. In an empirical evaluation of 76 NPE bugs from Apache Commons projects and the Defects4J benchmark, LogicFL accurately identified the fault locations and pinpointed the exact code fragments causing the NPEs for 67 bugs (88.16%), which were 19.64% and 4.69% more bugs than two compared LLM-based FL techniques respectively. In addition, LogicFL can be executed on a low-performance machine similar to a typical laptop, with an average runtime of 21.63 seconds and a worst-case time of under two minutes, including test execution and output file generation. Moreover, when compared to the two LLM-based FL techniques using the GPT-4o model, LogicFL was significantly more cost-efficient, as those techniques required 343.94 and 3,736.19 times the cost of LogicFL, respectively. Last but not least, the deduction process in LogicFL for providing FL results is fully traceable, enabling us to understand the reasoning behind the technique's outcomes and to further enhance the technique.
Cognitive Behavioral Therapy (CBT) is a well-established, evidence-based treatment for Major Depressive Disorder. Unfortunately, there exist significant barriers to individuals accessing CBT, including cost, scarcity of therapists and stigma. This study explores the feasibility of fine-tuning small open weight large language models (LLMs) to deliver CBT for depression. Using 58 sets of synthetic CBT transcripts generated by the Nous Research fine-tune of Llama 3.1 405b, we fine-tuned three models: Mistral 7b v0.3, Qwen 2.5 7b, and Llama 3.1 8b. CBT fidelity was evaluated through a modified Cognitive Therapy Rating Scale (CTRS). All fine-tuned models were compared against each other, as well as their instruct-tuned variants. Simulated patient transcripts were generated for the purpose of evaluating model performance, with the instruct and CBT-tuned models acting as the therapist and DeepSeek-V2.5 acting as the patient. These simulated transcripts were evaluated on a modified CTRS by Gemini 1.5 Pro-002. Our findings demonstrated that the CBT-tuned models significantly outperformed their instruct-tuned counterparts, with an average improvement of 11.33 points (p < 0.001) on total CTRS score. Llama 3.1 8b had the strongest performance (mean CTRS score 67.86 +/- 7.24), followed by Qwen 2.5 7b (64.28 +/- 9.55) and Mistral 7b v0.3 (64.17 +/- 9.79), with these differences between models being statistically significant. The CBT-tuned models were competent in implementing core CBT techniques and providing empathetic responses, however, there were limitations observed in agenda adherence, exploration depth and long-context coherence. This study establishes that CBT specific fine-tuning can effectively encode therapeutic competencies in small LLMs, though significant technical and ethical considerations must be resolved prior to clinical deployment.
Efficient inference in high-dimensional models is a central challenge in machine learning. We introduce the Gaussian Ensemble Belief Propagation (GEnBP) algorithm, which combines the strengths of the Ensemble Kalman Filter (EnKF) and Gaussian Belief Propagation (GaBP) to address this challenge. GEnBP updates ensembles of prior samples into posterior samples by passing low-rank local messages over the edges of a graphical model, enabling efficient handling of high-dimensional states, parameters, and complex, noisy, black-box generation processes. By utilizing local message passing within a graphical model structure, GEnBP effectively manages complex dependency structures and remains computationally efficient even when the ensemble size is much smaller than the inference dimension -- a common scenario in spatiotemporal modeling, image processing, and physical model inversion. We demonstrate that GEnBP can be applied to various problem structures, including data assimilation, system identification, and hierarchical models, and show through experiments that it outperforms existing belief propagation methods in terms of accuracy and computational efficiency. Supporting code is available at //github.com/danmackinlay/GEnBP
Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.
Graph Neural Networks (GNNs) have been shown to be effective models for different predictive tasks on graph-structured data. Recent work on their expressive power has focused on isomorphism tasks and countable feature spaces. We extend this theoretical framework to include continuous features - which occur regularly in real-world input domains and within the hidden layers of GNNs - and we demonstrate the requirement for multiple aggregation functions in this context. Accordingly, we propose Principal Neighbourhood Aggregation (PNA), a novel architecture combining multiple aggregators with degree-scalers (which generalize the sum aggregator). Finally, we compare the capacity of different models to capture and exploit the graph structure via a novel benchmark containing multiple tasks taken from classical graph theory, alongside existing benchmarks from real-world domains, all of which demonstrate the strength of our model. With this work, we hope to steer some of the GNN research towards new aggregation methods which we believe are essential in the search for powerful and robust models.
Reasoning with knowledge expressed in natural language and Knowledge Bases (KBs) is a major challenge for Artificial Intelligence, with applications in machine reading, dialogue, and question answering. General neural architectures that jointly learn representations and transformations of text are very data-inefficient, and it is hard to analyse their reasoning process. These issues are addressed by end-to-end differentiable reasoning systems such as Neural Theorem Provers (NTPs), although they can only be used with small-scale symbolic KBs. In this paper we first propose Greedy NTPs (GNTPs), an extension to NTPs addressing their complexity and scalability limitations, thus making them applicable to real-world datasets. This result is achieved by dynamically constructing the computation graph of NTPs and including only the most promising proof paths during inference, thus obtaining orders of magnitude more efficient models. Then, we propose a novel approach for jointly reasoning over KBs and textual mentions, by embedding logic facts and natural language sentences in a shared embedding space. We show that GNTPs perform on par with NTPs at a fraction of their cost while achieving competitive link prediction results on large datasets, providing explanations for predictions, and inducing interpretable models. Source code, datasets, and supplementary material are available online at //github.com/uclnlp/gntp.
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.