We present and evaluate the Futhark implementation of reverse-mode automatic differentiation (AD) for the basic blocks of parallel programming: reduce, prefix sum (scan), and reduce by index. We first present derivations of general-case algorithms and then discuss several specializations that result in efficient differentiation of most cases of practical interest. We report an experiment that evaluates the performance of the differentiated code in the context of GPU execution and highlights the impact of the proposed specializations as well as the strengths and weaknesses of differentiating at high level vs. low level (i.e., ``differentiating the memory'').
The concept of updating a probability distribution in the light of new evidence lies at the heart of statistics and machine learning. Pearl's and Jeffrey's rule are two natural update mechanisms which lead to different outcomes, yet the similarities and differences remain mysterious. This paper clarifies their relationship in several ways: via separate descriptions of the two update mechanisms in terms of probabilistic programs and sampling semantics, and via different notions of likelihood (for Pearl and for Jeffrey). Moreover, it is shown that Jeffrey's update rule arises via variational inference. In terms of categorical probability theory, this amounts to an analysis of the situation in terms of the behaviour of the multiset functor, extended to the Kleisli category of the distribution monad.
The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled properly by trained perception models. We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer), all trained on large-scale datasets. We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments, with the ultimate goal of facilitating the training of lightweight models for various downstream tasks. We also propose a multi-view labeling fusion stage, which considers the setting where multiple views of the scenes are available and can be used to identify and rectify single-view inconsistencies. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset. We evaluate the quality of our labeling process by comparing it with human annotations. Also, we demonstrate the effectiveness of the obtained labels in downstream tasks such as object goal navigation and part discovery. In the context of object goal navigation, we depict enhanced performance using this fusion approach compared to a zero-shot baseline that utilizes large monolithic vision-language pre-trained models.
We consider the problem of sequential evaluation, in which an evaluator observes candidates in a sequence and assigns scores to these candidates in an online, irrevocable fashion. Motivated by the psychology literature that has studied sequential bias in such settings -- namely, dependencies between the evaluation outcome and the order in which the candidates appear -- we propose a natural model for the evaluator's rating process that captures the lack of calibration inherent to such a task. We conduct crowdsourcing experiments to demonstrate various facets of our model. We then proceed to study how to correct sequential bias under our model by posing this as a statistical inference problem. We propose a near-linear time, online algorithm for this task and prove guarantees in terms of two canonical ranking metrics. We also prove that our algorithm is information theoretically optimal, by establishing matching lower bounds in both metrics. Finally, we perform a host of numerical experiments to show that our algorithm often outperforms the de facto method of using the rankings induced by the reported scores, both in simulation and on the crowdsourcing data that we collected.
This work focuses on developing methods for approximating the solution operators of a class of parametric partial differential equations via neural operators. Neural operators have several challenges, including the issue of generating appropriate training data, cost-accuracy trade-offs, and nontrivial hyperparameter tuning. The unpredictability of the accuracy of neural operators impacts their applications in downstream problems of inference, optimization, and control. A framework based on the linear variational problem that gives the correction to the prediction furnished by neural operators is considered based on earlier work in JCP 486 (2023) 112104. The operator, called Residual-based Error Corrector Operator or simply Corrector Operator, associated with the corrector problem is analyzed further. Numerical results involving a nonlinear reaction-diffusion model in two dimensions with PCANet-type neural operators show almost two orders of increase in the accuracy of approximations when neural operators are corrected using the correction scheme. Further, topology optimization involving a nonlinear reaction-diffusion model is considered to highlight the limitations of neural operators and the efficacy of the correction scheme. Optimizers with neural operator surrogates are seen to make significant errors (as high as 80 percent). However, the errors are much lower (below 7 percent) when neural operators are corrected.
Computerized cognitive training (CCT) is a scalable, well-tolerated intervention that has promise for slowing cognitive decline. Outcomes from CCT are limited by a lack of effective engagement, which is decreased by factors such as mental fatigue, particularly in older adults at risk for dementia. There is a need for scalable, automated measures that can monitor mental fatigue during CCT. Here, we develop and validate a novel Recurrent Video Transformer (RVT) method for monitoring real-time mental fatigue in older adults with mild cognitive impairment from video-recorded facial gestures during CCT. The RVT model achieved the highest balanced accuracy(78%) and precision (0.82) compared to the prior state-of-the-art models for binary and multi-class classification of mental fatigue and was additionally validated via significant association (p=0.023) with CCT reaction time. By leveraging dynamic temporal information, the RVT model demonstrates the potential to accurately measure real-time mental fatigue, laying the foundation for future personalized CCT that increase effective engagement.
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval methods struggle to meet the needs of users demanding access to data from various modalities. To address this, cross-modal retrieval has emerged, enabling interaction across modalities, facilitating semantic matching, and leveraging complementarity and consistency between different modal data. Although prior literature undertook a review of the cross-modal retrieval field, it exhibits numerous deficiencies pertaining to timeliness, taxonomy, and comprehensiveness. This paper conducts a comprehensive review of cross-modal retrieval's evolution, spanning from shallow statistical analysis techniques to vision-language pre-training models. Commencing with a comprehensive taxonomy grounded in machine learning paradigms, mechanisms, and models, the paper then delves deeply into the principles and architectures underpinning existing cross-modal retrieval methods. Furthermore, it offers an overview of widely used benchmarks, metrics, and performances. Lastly, the paper probes the prospects and challenges that confront contemporary cross-modal retrieval, while engaging in a discourse on potential directions for further progress in the field. To facilitate the research on cross-modal retrieval, we develop an open-source code repository at //github.com/BMC-SDNU/Cross-Modal-Retrieval.
Knowledge graph reasoning (KGR), aiming to deduce new facts from existing facts based on mined logic rules underlying knowledge graphs (KGs), has become a fast-growing research direction. It has been proven to significantly benefit the usage of KGs in many AI applications, such as question answering and recommendation systems, etc. According to the graph types, the existing KGR models can be roughly divided into three categories, \textit{i.e.,} static models, temporal models, and multi-modal models. The early works in this domain mainly focus on static KGR and tend to directly apply general knowledge graph embedding models to the reasoning task. However, these models are not suitable for more complex but practical tasks, such as inductive static KGR, temporal KGR, and multi-modal KGR. To this end, multiple works have been developed recently, but no survey papers and open-source repositories comprehensively summarize and discuss models in this important direction. To fill the gap, we conduct a survey for knowledge graph reasoning tracing from static to temporal and then to multi-modal KGs. Concretely, the preliminaries, summaries of KGR models, and typical datasets are introduced and discussed consequently. Moreover, we discuss the challenges and potential opportunities. The corresponding open-source repository is shared on GitHub: //github.com/LIANGKE23/Awesome-Knowledge-Graph-Reasoning.
The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention based feature embedding that captures both entity and relation features in any given entity's neighborhood. Additionally, we also encapsulate relation clusters and multihop relations in our model. Our empirical study offers insights into the efficacy of our attention based model and we show marked performance gains in comparison to state of the art methods on all datasets.
We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.
Within the rapidly developing Internet of Things (IoT), numerous and diverse physical devices, Edge devices, Cloud infrastructure, and their quality of service requirements (QoS), need to be represented within a unified specification in order to enable rapid IoT application development, monitoring, and dynamic reconfiguration. But heterogeneities among different configuration knowledge representation models pose limitations for acquisition, discovery and curation of configuration knowledge for coordinated IoT applications. This paper proposes a unified data model to represent IoT resource configuration knowledge artifacts. It also proposes IoT-CANE (Context-Aware recommendatioN systEm) to facilitate incremental knowledge acquisition and declarative context driven knowledge recommendation.