Accurate prediction of human behavior is crucial for effective human-robot interaction (HRI) systems, especially in dynamic environments where real-time decisions are essential. This paper addresses the challenge of forecasting future human behavior using multivariate time series data from wearable sensors, which capture various aspects of human movement. The presence of hidden confounding factors in this data often leads to biased predictions, limiting the reliability of traditional models. To overcome this, we propose a robust predictive model that integrates deconfounding techniques with advanced time series prediction methods, enhancing the model's ability to isolate true causal relationships and improve prediction accuracy. Evaluation on real-world datasets demonstrates that our approach significantly outperforms traditional methods, providing a more reliable foundation for responsive and adaptive HRI systems.
We propose a framework for adaptive data-centric collaborative learning among self-interested agents, coordinated by an arbiter. Designed to handle the incremental nature of real-world data, the framework operates in an online manner: at each step, the arbiter collects a batch of data from agents, trains a machine learning model, and provides each agent with a distinct model reflecting its data contributions. This setup establishes a feedback loop where shared data influence model updates, and the resulting models guide future data-sharing strategies. Agents evaluate and partition their data, selecting a partition to share using a stochastic parameterized policy optimized via policy gradient methods to optimize the utility of the received model as defined by agent-specific evaluation functions. On the arbiter side, the expected loss function over the true data distribution is optimized, incorporating agent-specific weights to account for distributional differences arising from diverse sources and selective sharing. A bilevel optimization algorithm jointly learns the model parameters and agent-specific weights. Mean-zero noise, computed using a distortion function that adjusts these agent-specific weights, is introduced to generate distinct agent-specific models, promoting valuable data sharing without requiring separate training. Our framework is underpinned by non-asymptotic analyses, ensuring convergence of the agent-side policy optimization to an approximate stationary point of the evaluation functions and convergence of the arbiter-side optimization to an approximate stationary point of the expected loss function.
The outstanding capabilities of large language models (LLMs) render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine-tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents' ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi-step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. Inheriting the spirit of novice-to-expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine-grained optimization. Additionally, we propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.
Dense 3D correspondence can enhance robotic manipulation by enabling the generalization of spatial, functional, and dynamic information from one object to an unseen counterpart. Compared to shape correspondence, semantic correspondence is more effective in generalizing across different object categories. To this end, we present DenseMatcher, a method capable of computing 3D correspondences between in-the-wild objects that share similar structures. DenseMatcher first computes vertex features by projecting multiview 2D features onto meshes and refining them with a 3D network, and subsequently finds dense correspondences with the obtained features using functional map. In addition, we craft the first 3D matching dataset that contains colored object meshes across diverse categories. In our experiments, we show that DenseMatcher significantly outperforms prior 3D matching baselines by 43.5%. We demonstrate the downstream effectiveness of DenseMatcher in (i) robotic manipulation, where it achieves cross-instance and cross-category generalization on long-horizon complex manipulation tasks from observing only one demo; (ii) zero-shot color mapping between digital assets, where appearance can be transferred between different objects with relatable geometry.
Surgical interventions, particularly in neurology, represent complex and high-stakes scenarios that impose substantial cognitive burdens on surgical teams. Although deliberate education and practice can enhance cognitive capabilities, surgical training opportunities remain limited due to patient safety concerns. To address these cognitive challenges in surgical training and operation, we propose SurgBox, an agent-driven sandbox framework to systematically enhance the cognitive capabilities of surgeons in immersive surgical simulations. Specifically, our SurgBox leverages large language models (LLMs) with tailored Retrieval-Augmented Generation (RAG) to authentically replicate various surgical roles, enabling realistic training environments for deliberate practice. In particular, we devise Surgery Copilot, an AI-driven assistant to actively coordinate the surgical information stream and support clinical decision-making, thereby diminishing the cognitive workload of surgical teams during surgery. By incorporating a novel Long-Short Memory mechanism, our Surgery Copilot can effectively balance immediate procedural assistance with comprehensive surgical knowledge. Extensive experiments using real neurosurgical procedure records validate our SurgBox framework in both enhancing surgical cognitive capabilities and supporting clinical decision-making. By providing an integrated solution for training and operational support to address cognitive challenges, our SurgBox framework advances surgical education and practice, potentially transforming surgical outcomes and healthcare quality. The code is available at //github.com/franciszchen/SurgBox.
Recent advances in deep learning and natural language generation have significantly improved image captioning, enabling automated, human-like descriptions for visual content. In this work, we apply these captioning techniques to generate clinician-like interpretations of ECG data. This study leverages existing ECG datasets accompanied by free-text reports authored by healthcare professionals (HCPs) as training data. These reports, while often inconsistent, provide a valuable foundation for automated learning. We introduce an encoder-decoder-based method that uses these reports to train models to generate detailed descriptions of ECG episodes. This represents a significant advancement in ECG analysis automation, with potential applications in zero-shot classification and automated clinical decision support. The model is tested on various datasets, including both 1- and 12-lead ECGs. It significantly outperforms the state-of-the-art reference model by Qiu et al., achieving a METEOR score of 55.53% compared to 24.51% achieved by the reference model. Furthermore, several key design choices are discussed, providing a comprehensive overview of current challenges and innovations in this domain. The source codes for this research are publicly available in our Git repository //git.zib.de/ableich/ecg-comment-generation-public
Human intelligence thrives on the concept of cognitive synergy, where collaboration and information integration among different cognitive processes yield superior outcomes compared to individual cognitive processes in isolation. Although Large Language Models (LLMs) have demonstrated promising performance as general task-solving agents, they still struggle with tasks that require intensive domain knowledge and complex reasoning. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist refers to an intelligent agent that collaborates with multiple minds, combining their individual strengths and knowledge, to enhance problem-solving and overall performance in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. We have discovered that assigning multiple, fine-grained personas in LLMs elicits better problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, SPP effectively elicits internal knowledge acquisition abilities, reduces hallucination, and maintains strong reasoning capabilities. Code, data, and prompts can be found at: //github.com/MikeWangWZHL/Solo-Performance-Prompting.git.
The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.
Inspired by the human cognitive system, attention is a mechanism that imitates the human cognitive awareness about specific information, amplifying critical details to focus more on the essential aspects of data. Deep learning has employed attention to boost performance for many applications. Interestingly, the same attention design can suit processing different data modalities and can easily be incorporated into large networks. Furthermore, multiple complementary attention mechanisms can be incorporated in one network. Hence, attention techniques have become extremely attractive. However, the literature lacks a comprehensive survey specific to attention techniques to guide researchers in employing attention in their deep models. Note that, besides being demanding in terms of training data and computational resources, transformers only cover a single category in self-attention out of the many categories available. We fill this gap and provide an in-depth survey of 50 attention techniques categorizing them by their most prominent features. We initiate our discussion by introducing the fundamental concepts behind the success of attention mechanism. Next, we furnish some essentials such as the strengths and limitations of each attention category, describe their fundamental building blocks, basic formulations with primary usage, and applications specifically for computer vision. We also discuss the challenges and open questions related to attention mechanism in general. Finally, we recommend possible future research directions for deep attention.
Traffic forecasting is an important factor for the success of intelligent transportation systems. Deep learning models including convolution neural networks and recurrent neural networks have been applied in traffic forecasting problems to model the spatial and temporal dependencies. In recent years, to model the graph structures in the transportation systems as well as the contextual information, graph neural networks (GNNs) are introduced as new tools and have achieved the state-of-the-art performance in a series of traffic forecasting problems. In this survey, we review the rapidly growing body of recent research using different GNNs, e.g., graph convolutional and graph attention networks, in various traffic forecasting problems, e.g., road traffic flow and speed forecasting, passenger flow forecasting in urban rail transit systems, demand forecasting in ride-hailing platforms, etc. We also present a collection of open data and source resources for each problem, as well as future research directions. To the best of our knowledge, this paper is the first comprehensive survey that explores the application of graph neural networks for traffic forecasting problems. We have also created a public Github repository to update the latest papers, open data and source resources.
Detecting carried objects is one of the requirements for developing systems to reason about activities involving people and objects. We present an approach to detect carried objects from a single video frame with a novel method that incorporates features from multiple scales. Initially, a foreground mask in a video frame is segmented into multi-scale superpixels. Then the human-like regions in the segmented area are identified by matching a set of extracted features from superpixels against learned features in a codebook. A carried object probability map is generated using the complement of the matching probabilities of superpixels to human-like regions and background information. A group of superpixels with high carried object probability and strong edge support is then merged to obtain the shape of the carried object. We applied our method to two challenging datasets, and results show that our method is competitive with or better than the state-of-the-art.