This paper introduces Artificial Intelligence Clinics on Mobile (AICOM), an open-source project devoted to answering the United Nations Sustainable Development Goal 3 (SDG3) on health, which represents a universal recognition that health is fundamental to human capital and social and economic development. The core motivation for the AICOM project is the fact that over 80% of the people in the least developed countries (LDCs) own a mobile phone, even though less than 40% of these people have internet access. Hence, through enabling AI-based disease diagnostics and screening capability on affordable mobile phones without connectivity will be a critical first step to addressing healthcare access problems. The technologies developed in the AICOM project achieve exactly this goal, and we have demonstrated the effectiveness of AICOM on monkeypox screening tasks. We plan to continue expanding and open-sourcing the AICOM platform, aiming for it to evolve into an universal AI doctor for the Underserved and Hard-to-Reach.
The Internet of Things (IoT) devices are rapidly increasing in popularity, with more individuals using Internet-connected devices that continuously monitor their activities. This work explores privacy concerns and expectations of end-users related to Trigger-Action platforms (TAPs) in the context of the Internet of Things (IoT). TAPs allow users to customize their smart environments by creating rules that trigger actions based on specific events or conditions. As personal data flows between different entities, there is a potential for privacy concerns. In this study, we aimed to identify the privacy factors that impact users' concerns and preferences for using IoT TAPs. To address this research objective, we conducted three focus groups with 15 participants and we extracted nine themes related to privacy factors using thematic analysis. Our participants particularly prefer to have control and transparency over the automation and are concerned about unexpected data inferences, risks and unforeseen consequences for themselves and for bystanders that are caused by the automation. The identified privacy factors can help researchers derive predefined and selectable profiles of privacy permission settings for IoT TAPs that represent the privacy preferences of different types of users as a basis for designing usable privacy controls for IoT TAPs.
We introduce TitaNet-LID, a compact end-to-end neural network for Spoken Language Identification (LID) that is based on the ContextNet architecture. TitaNet-LID employs 1D depth-wise separable convolutions and Squeeze-and-Excitation layers to effectively capture local and global context within an utterance. Despite its small size, TitaNet-LID achieves performance similar to state-of-the-art models on the VoxLingua107 dataset while being 10 times smaller. Furthermore, it can be easily adapted to new acoustic conditions and unseen languages through simple fine-tuning, achieving a state-of-the-art accuracy of 88.2% on the FLEURS benchmark. Our model is scalable and can achieve a better trade-off between accuracy and speed. TitaNet-LID performs well even on short utterances less than 5s in length, indicating its robustness to input length.
Explainable AI has the potential to support more interactive and fluid co-creative AI systems which can creatively collaborate with people. To do this, creative AI models need to be amenable to debugging by offering eXplainable AI (XAI) features which are inspectable, understandable, and modifiable. However, currently there is very little XAI for the arts. In this work, we demonstrate how a latent variable model for music generation can be made more explainable; specifically we extend MeasureVAE which generates measures of music. We increase the explainability of the model by: i) using latent space regularisation to force some specific dimensions of the latent space to map to meaningful musical attributes, ii) providing a user interface feedback loop to allow people to adjust dimensions of the latent space and observe the results of these changes in real-time, iii) providing a visualisation of the musical attributes in the latent space to help people understand and predict the effect of changes to latent space dimensions. We suggest that in doing so we bridge the gap between the latent space and the generated musical outcomes in a meaningful way which makes the model and its outputs more explainable and more debuggable.
Visual Inertial Odometry (VIO) is an essential component of modern Augmented Reality (AR) applications. However, VIO only tracks the relative pose of the device, leading to drift over time. Absolute pose estimation methods infer the device's absolute pose, but their accuracy depends on the input quality. This paper introduces VIO-APR, a new framework for markerless mobile AR that combines an absolute pose regressor (APR) with a local VIO tracking system. VIO-APR uses VIO to assess the reliability of the APR and the APR to identify and compensate for VIO drift. This feedback loop results in more accurate positioning and more stable AR experiences. To evaluate VIO-APR, we created a dataset that combines camera images with ARKit's VIO system output for six indoor and outdoor scenes of various scales. Over this dataset, VIO-APR improves the median accuracy of popular APR by up to 36\% in position and 29\% in orientation, increases the percentage of frames in the high ($0.25 m, 2^{\circ}$) accuracy level by up to 112\% and reduces the percentage of frames predicted below the low ($5 m, 10^\circ$) accuracy greatly. We implement VIO-APR into a mobile AR application using Unity to demonstrate its capabilities. VIO-APR results in noticeably more accurate localization and a more stable overall experience.
The application of Transformer neural networks to Electronic Health Records (EHR) is challenging due to the distinct, multidimensional sequential structure of EHR data, often leading to underperformance when compared to simpler linear models. Thus, the advantages of Transformers, such as efficient transfer learning and improved scalability are not fully exploited in EHR applications. To overcome these challenges, we introduce SANSformer, a novel attention-free sequential model designed specifically with inductive biases to cater for the unique characteristics of EHR data. Our main application area is predicting future healthcare utilization, a crucial task for effectively allocating healthcare resources. This task becomes particularly difficult when dealing with divergent patient subgroups. These subgroups, characterized by unique health trajectories and often small in size, such as patients with rare diseases, require specialized modeling approaches. To address this, we adopt a self-supervised pretraining strategy, which we term Generative Summary Pretraining (GSP). GSP predicts summary statistics of a future window in the patient's history based on their past health records, thus demonstrating potential to deal with the noisy and complex nature of EHR data. We pretrain our models on a comprehensive health registry encompassing close to one million patients, before fine-tuning them for specific subgroup prediction tasks. In our evaluations, SANSformer consistently outshines strong EHR baselines. Importantly, our GSP pretraining method greatly enhances model performance, especially for smaller patient subgroups. Our findings underscore the substantial potential of bespoke attention-free models and self-supervised pretraining for enhancing healthcare utilization predictions across a broad range of patient groups.
ModSecurity is widely recognized as the standard open-source Web Application Firewall (WAF), maintained by the OWASP Foundation. It detects malicious requests by matching them against the Core Rule Set, identifying well-known attack patterns. Each rule in the CRS is manually assigned a weight, based on the severity of the corresponding attack, and a request is detected as malicious if the sum of the weights of the firing rules exceeds a given threshold. In this work, we show that this simple strategy is largely ineffective for detecting SQL injection (SQLi) attacks, as it tends to block many legitimate requests, while also being vulnerable to adversarial SQLi attacks, i.e., attacks intentionally manipulated to evade detection. To overcome these issues, we design a robust machine learning model, named AdvModSec, which uses the CRS rules as input features, and it is trained to detect adversarial SQLi attacks. Our experiments show that AdvModSec, being trained on the traffic directed towards the protected web services, achieves a better trade-off between detection and false positive rates, improving the detection rate of the vanilla version of ModSecurity with CRS by 21%. Moreover, our approach is able to improve its adversarial robustness against adversarial SQLi attacks by 42%, thereby taking a step forward towards building more robust and trustworthy WAFs.
We propose a novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views. Without the depth information, regressing appropriate plane locations is a challenging problem. To solve this issue, we pre-partition the disparity space into bins and design a disparity sampler to regress local offsets for multiple planes in each bin. However, only using such a sampler makes the network not convergent; we further propose two optimizing strategies that combine with different disparity distributions of datasets and propose an occlusion-aware reprojection loss as a simple yet effective geometric supervision technique. We also introduce a self-attention mechanism to improve occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module to address the problem of applying self-attention to large feature maps. We demonstrate the effectiveness of our approach and generate state-of-the-art results on different datasets. Compared to MINE, our approach has an LPIPS reduction of 4.8%-9.0% and an RV reduction of 73.9%-83.5%. We also evaluate the performance on real-world images and demonstrate the benefits.
The NSF-funded Robust Epidemic Surveillance and Modeling (RESUME) project successfully convened a workshop entitled "High-performance computing and large-scale data management in service of epidemiological modeling" at the University of Chicago on May 1-2, 2023. This was part of a series of workshops designed to foster sustainable and interdisciplinary co-design for predictive intelligence and pandemic prevention. The event brought together 31 experts in epidemiological modeling, high-performance computing (HPC), HPC workflows, and large-scale data management to develop a shared vision for capabilities needed for computational epidemiology to better support pandemic prevention. Through the workshop, participants identified key areas in which HPC capabilities could be used to improve epidemiological modeling, particularly in supporting public health decision-making, with an emphasis on HPC workflows, data integration, and HPC access. The workshop explored nascent HPC workflow and large-scale data management approaches currently in use for epidemiological modeling and sought to draw from approaches used in other domains to determine which practices could be best adapted for use in epidemiological modeling. This report documents the key findings and takeaways from the workshop.
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (\emph{e.g.,} social network analysis and recommender systems), computer vision (\emph{e.g.,} object detection and point cloud learning), and natural language processing (\emph{e.g.,} relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, \emph{i.e.,} 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.
Collecting supporting evidence from large corpora of text (e.g., Wikipedia) is of great challenge for open-domain Question Answering (QA). Especially, for multi-hop open-domain QA, scattered evidence pieces are required to be gathered together to support the answer extraction. In this paper, we propose a new retrieval target, hop, to collect the hidden reasoning evidence from Wikipedia for complex question answering. Specifically, the hop in this paper is defined as the combination of a hyperlink and the corresponding outbound link document. The hyperlink is encoded as the mention embedding which models the structured knowledge of how the outbound link entity is mentioned in the textual context, and the corresponding outbound link document is encoded as the document embedding representing the unstructured knowledge within it. Accordingly, we build HopRetriever which retrieves hops over Wikipedia to answer complex questions. Experiments on the HotpotQA dataset demonstrate that HopRetriever outperforms previously published evidence retrieval methods by large margins. Moreover, our approach also yields quantifiable interpretations of the evidence collection process.