rPPG (Remote photoplethysmography) is a technology that measures and analyzes BVP (Blood Volume Pulse) by using the light absorption characteristics of hemoglobin captured through a camera. Analyzing the measured BVP can derive various physiological signals such as heart rate, stress level, and blood pressure, which can be applied to various applications such as telemedicine, remote patient monitoring, and early prediction of cardiovascular disease. rPPG is rapidly evolving and attracting great attention from both academia and industry by providing great usability and convenience as it can measure biosignals using a camera-equipped device without medical or wearable devices. Despite extensive efforts and advances in this field, serious challenges remain, including issues related to skin color, camera characteristics, ambient lighting, and other sources of noise and artifacts, which degrade accuracy performance. We argue that fair and evaluable benchmarking is urgently required to overcome these challenges and make meaningful progress from both academic and commercial perspectives. In most existing work, models are trained, tested, and validated only on limited datasets. Even worse, some studies lack available code or reproducibility, making it difficult to fairly evaluate and compare performance. Therefore, the purpose of this study is to provide a benchmarking framework to evaluate various rPPG techniques across a wide range of datasets for fair evaluation and comparison, including both conventional non-deep neural network (non-DNN) and deep neural network (DNN) methods. GitHub URL: //github.com/remotebiosensing/rppg
Background: Bayesian Networks (BNs) are probabilistic graphical models that leverage Bayes' theorem to portray dependencies and cause-and-effect relationships between variables. These networks have gained prominence in the field of health sciences, particularly in diagnostic processes, by allowing the integration of medical knowledge into models and addressing uncertainty in a probabilistic manner. Objectives: This review aims to provide an exhaustive overview of the current state of Bayesian Networks in disease diagnosis and prognosis. Additionally, it seeks to introduce readers to the fundamental methodology of BNs, emphasizing their versatility and applicability across varied medical domains. Methods: Employing a meticulous search strategy with MeSH descriptors in diverse scientific databases, we identified 190 relevant references. These were subjected to a rigorous analysis, resulting in the retention of 60 papers for an in-depth review. The robustness of our approach minimizes the risk of selection bias. Results: The selected studies encompass a wide range of medical areas, providing insights into the statistical methodology, implementation feasibility, and predictive accuracy of BNs, as evidenced by an average AUC exceeding 75%. The comprehensive analysis underscores the adaptability and efficacy of Bayesian Networks in diverse clinical scenarios. Conclusion: The encompassing exploration of Bayesian Networks presented in this review highlights their significance and growing impact in the realm of disease diagnosis and prognosis. It underscores the need for further research and development to optimize their capabilities and broaden their applicability in addressing diverse and intricate healthcare challenges.
Connected Medical Devices (CMDs) have a large impact on patients as they allow them to lead a more normal life. Any malfunction could not only remove the health benefits the CMDs provide, they could also cause further harm to the patient. Due to this, there are many safety regulations which must be adhered to prior to a CMD entering the market. However, while many detailed safety regulations exist, there are a fundamental lack of cybersecurity frameworks applicable to CMDs. While there are recent regulations which aim to enforce cybersecurity practices, they are vague and do not contain the concrete steps necessary to implement cybersecurity. This paper aims to fill that gap by describing a framework, CyMed, to be used by vendors and ens-users, which contains concrete measures to improve the resilience of CMDs against cyber attack. The CyMed framework is subsequently evaluated based on practical tests as well as expert interviews.
AI alignment is about ensuring AI systems only pursue goals and activities that are beneficial to humans. Most of the current approach to AI alignment is to learn what humans value from their behavioural data. This paper proposes a different way of looking at the notion of alignment, namely by introducing AI Alignment Dialogues: dialogues with which users and agents try to achieve and maintain alignment via interaction. We argue that alignment dialogues have a number of advantages in comparison to data-driven approaches, especially for behaviour support agents, which aim to support users in achieving their desired future behaviours rather than their current behaviours. The advantages of alignment dialogues include allowing the users to directly convey higher-level concepts to the agent, and making the agent more transparent and trustworthy. In this paper we outline the concept and high-level structure of alignment dialogues. Moreover, we conducted a qualitative focus group user study from which we developed a model that describes how alignment dialogues affect users, and created design suggestions for AI alignment dialogues. Through this we establish foundations for AI alignment dialogues and shed light on what requires further development and research.
The 4D Millimeter wave (mmWave) radar is a promising technology for vehicle sensing due to its cost-effectiveness and operability in adverse weather conditions. However, the adoption of this technology has been hindered by sparsity and noise issues in radar point cloud data. This paper introduces spatial multi-representation fusion (SMURF), a novel approach to 3D object detection using a single 4D imaging radar. SMURF leverages multiple representations of radar detection points, including pillarization and density features of a multi-dimensional Gaussian mixture distribution through kernel density estimation (KDE). KDE effectively mitigates measurement inaccuracy caused by limited angular resolution and multi-path propagation of radar signals. Additionally, KDE helps alleviate point cloud sparsity by capturing density features. Experimental evaluations on View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate the effectiveness and generalization ability of SMURF, outperforming recently proposed 4D imaging radar-based single-representation models. Moreover, while using 4D imaging radar only, SMURF still achieves comparable performance to the state-of-the-art 4D imaging radar and camera fusion-based method, with an increase of 1.22% in the mean average precision on bird's-eye view of TJ4DRadSet dataset and 1.32% in the 3D mean average precision on the entire annotated area of VoD dataset. Our proposed method demonstrates impressive inference time and addresses the challenges of real-time detection, with the inference time no more than 0.05 seconds for most scans on both datasets. This research highlights the benefits of 4D mmWave radar and is a strong benchmark for subsequent works regarding 3D object detection with 4D imaging radar.
Recent vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs are also dramatically growing with rapid development, especially for the large models. It makes model acceleration exceedingly critical in a scenario of limited resources. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is relatively under-explored. To pursue more efficient and accessible vision-language Transformers, this paper introduces \textbf{Cross}-\textbf{G}uided \textbf{E}nsemble of \textbf{T}okens (\textbf{\emph{CrossGET}}), a universal acceleration framework for vision-language Transformers. This framework adaptively combines tokens through real-time, cross-modal guidance, thereby achieving substantial acceleration while keeping high performance. \textit{CrossGET} has two key innovations: 1) \textit{Cross-Guided Matching and Ensemble}. \textit{CrossGET} incorporates cross-modal guided token matching and ensemble to exploit cross-modal information effectively, only introducing cross-modal tokens with negligible extra parameters. 2) \textit{Complete-Graph Soft Matching}. In contrast to the existing bipartite soft matching approach, \textit{CrossGET} introduces a complete-graph soft matching policy to achieve more reliable token-matching results while maintaining parallelizability and high efficiency. Extensive experiments are conducted on various vision-language tasks, including image-text retrieval, visual reasoning, image captioning, and visual question answering. Performance on both classic multimodal architectures and emerging multimodal LLMs demonstrate the effectiveness and versatility of the proposed \textit{CrossGET} framework. The code will be at \url{//github.com/sdc17/CrossGET}.
The photometric stereo (PS) problem consists in reconstructing the 3D-surface of an object, thanks to a set of photographs taken under different lighting directions. In this paper, we propose a multi-scale architecture for PS which, combined with a new dataset, yields state-of-the-art results. Our proposed architecture is flexible: it permits to consider a variable number of images as well as variable image size without loss of performance. In addition, we define a set of constraints to allow the generation of a relevant synthetic dataset to train convolutional neural networks for the PS problem. Our proposed dataset is much larger than pre-existing ones, and contains many objects with challenging materials having anisotropic reflectance (e.g. metals, glass). We show on publicly available benchmarks that the combination of both these contributions drastically improves the accuracy of the estimated normal field, in comparison with previous state-of-the-art methods.
Visual Question Answering (VQA) is a challenging task that requires cross-modal understanding and reasoning of visual image and natural language question. To inspect the association of VQA models to human cognition, we designed a survey to record human thinking process and analyzed VQA models by comparing the outputs and attention maps with those of humans. We found that although the VQA models resemble human cognition in architecture and performs similarly with human on the recognition-level, they still struggle with cognitive inferences. The analysis of human thinking procedure serves to direct future research and introduce more cognitive capacity into modeling features and architectures.
Task and Motion Planning (TAMP) approaches are effective at planning long-horizon autonomous robot manipulation. However, it can be difficult to apply them to domains where the environment and its dynamics are not fully known. We propose to overcome these limitations by leveraging deep generative modeling, specifically diffusion models, to learn constraints and samplers that capture these difficult-to-engineer aspects of the planning model. These learned samplers are composed and combined within a TAMP solver in order to find action parameter values jointly that satisfy the constraints along a plan. To tractably make predictions for unseen objects in the environment, we define these samplers on low-dimensional learned latent embeddings of changing object state. We evaluate our approach in an articulated object manipulation domain and show how the combination of classical TAMP, generative learning, and latent embeddings enables long-horizon constraint-based reasoning. We also apply the learned sampler in the real world. More details are available at //sites.google.com/view/dimsam-tamp
Graph neural networks (GNNs) have demonstrated a significant boost in prediction performance on graph data. At the same time, the predictions made by these models are often hard to interpret. In that regard, many efforts have been made to explain the prediction mechanisms of these models from perspectives such as GNNExplainer, XGNN and PGExplainer. Although such works present systematic frameworks to interpret GNNs, a holistic review for explainable GNNs is unavailable. In this survey, we present a comprehensive review of explainability techniques developed for GNNs. We focus on explainable graph neural networks and categorize them based on the use of explainable methods. We further provide the common performance metrics for GNNs explanations and point out several future research directions.
Graph Neural Networks (GNNs) have been studied from the lens of expressive power and generalization. However, their optimization properties are less well understood. We take the first step towards analyzing GNN training by studying the gradient dynamics of GNNs. First, we analyze linearized GNNs and prove that despite the non-convexity of training, convergence to a global minimum at a linear rate is guaranteed under mild assumptions that we validate on real-world graphs. Second, we study what may affect the GNNs' training speed. Our results show that the training of GNNs is implicitly accelerated by skip connections, more depth, and/or a good label distribution. Empirical results confirm that our theoretical results for linearized GNNs align with the training behavior of nonlinear GNNs. Our results provide the first theoretical support for the success of GNNs with skip connections in terms of optimization, and suggest that deep GNNs with skip connections would be promising in practice.