Multiphysics phenomena, the coupling effects involving different aspects of physics laws, are pervasive in the real world and can often be encountered when performing everyday household tasks. Intelligent agents which seek to assist or replace human laborers will need to learn to cope with such phenomena in household task settings. To equip the agents with such kind of abilities, the research community needs a simulation environment, which will have the capability to serve as the testbed for the training process of these intelligent agents, to have the ability to support multiphysics coupling effects. Though many mature simulation software for multiphysics simulation have been adopted in industrial production, such techniques have not been applied to robot learning or embodied AI research. To bridge the gap, we propose a novel simulation environment named RFUniverse. This simulator can not only compute rigid and multi-body dynamics, but also multiphysics coupling effects commonly observed in daily life, such as air-solid interaction, fluid-solid interaction, and heat transfer. Because of the unique multiphysics capacities of this simulator, we can benchmark tasks that involve complex dynamics due to multiphysics coupling effects in a simulation environment before deploying to the real world. RFUniverse provides multiple interfaces to let the users interact with the virtual world in various ways, which is helpful and essential for learning, planning, and control. We benchmark three tasks with reinforcement learning, including food cutting, water pushing, and towel catching. We also evaluate butter pushing with a classic planning-control paradigm. This simulator offers an enhancement of physics simulation in terms of the computation of multiphysics coupling effects.
Large language models (LLMs) provide a promising tool that enable robots to perform complex robot reasoning tasks. However, the limited context window of contemporary LLMs makes reasoning over long time horizons difficult. Embodied tasks such as those that one might expect a household robot to perform typically require that the planner consider information acquired a long time ago (e.g., properties of the many objects that the robot previously encountered in the environment). Attempts to capture the world state using an LLM's implicit internal representation is complicated by the paucity of task- and environment-relevant information available in a robot's action history, while methods that rely on the ability to convey information via the prompt to the LLM are subject to its limited context window. In this paper, we propose Statler, a framework that endows LLMs with an explicit representation of the world state as a form of ``memory'' that is maintained over time. Integral to Statler is its use of two instances of general LLMs -- a world-model reader and a world-model writer -- that interface with and maintain the world state. By providing access to this world state ``memory'', Statler improves the ability of existing LLMs to reason over longer time horizons without the constraint of context length. We evaluate the effectiveness of our approach on three simulated table-top manipulation domains and a real robot domain, and show that it improves the state-of-the-art in LLM-based robot reasoning. Project website: //statler-lm.github.io/
Robot programming tools ranging from inverse kinematics (IK) to model predictive control (MPC) are most often described as constrained optimization problems. Even though there are currently many commercially-available second-order solvers, robotics literature recently focused on efficient implementations and improvements over these solvers for real-time robotic applications. However, most often, these implementations stay problem-specific and are not easy to access or implement, or do not exploit the geometric aspect of the robotics problems. In this work, we propose to solve these problems using a fast, easy-to-implement first-order method that fully exploits the geometric constraints via Euclidean projections, called Augmented Lagrangian Spectral Projected Gradient Descent (ALSPG). We show that 1. using projections instead of full constraints and gradients improves the performance of the solver and 2. ALSPG stays competitive to the standard second-order methods such as iLQR in the unconstrained case. We showcase these results with IK and motion planning problems on simulated examples and with an MPC problem on a 7-axis manipulator experiment.
Source code diffs are used on a daily basis as part of code review, inspection, and auditing. To facilitate understanding, they are typically accompanied by explanations that describe the essence of what is changed in the program. As manually crafting high-quality explanations is a cumbersome task, researchers have proposed automatic techniques to generate code diff explanations. Existing explanation generation methods solely focus on static analysis, i.e., they do not take advantage of runtime information to explain code changes. In this paper, we propose Collector-Sahab, a novel tool that augments code diffs with runtime difference information. Collector-Sahab compares the program states of the original (old) and patched (new) versions of a program to find unique variable values. Then, Collector-Sahab adds this novel runtime information to the source code diff as shown, for instance, in code reviewing systems. As an evaluation, we run Collector-Sahab on 584 code diffs for Defects4J bugs and find it successfully augments the code diff for 95% (555/584) of them. We also perform a user study and ask eight participants to score the augmented code diffs generated by Collector-Sahab. Per this user study, we conclude that developers find the idea of adding runtime data to code diffs promising and useful. Overall, our experiments show the effectiveness and usefulness of Collector-Sahab in augmenting code diffs with runtime difference information. Publicly-available repository: //github.com/ASSERT-KTH/collector-sahab.
Increasingly popular home assistants are widely utilized as the central controller for smart home devices. However, current designs heavily rely on voice interfaces with accessibility and usability issues; some latest ones are equipped with additional cameras and displays, which are costly and raise privacy concerns. These concerns jointly motivate Beyond-Voice, a novel deep-learning-driven acoustic sensing system that allows commodity home assistant devices to track and reconstruct hand poses continuously. It transforms the home assistant into an active sonar system using its existing onboard microphones and speakers. We feed a high-resolution range profile to the deep learning model that can analyze the motions of multiple body parts and predict the 3D positions of 21 finger joints, bringing the granularity for acoustic hand tracking to the next level. It operates across different environments and users without the need for personalized training data. A user study with 11 participants in 3 different environments shows that Beyond-Voice can track joints with an average mean absolute error of 16.47mm without any training data provided by the testing subject.
We show, for the first time, that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape (HPS) estimation from real images. Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing. Achieving sufficient realism is non-trivial and we show how to do this for full bodies in motion. Specifically, our BEDLAM dataset contains monocular RGB videos with ground-truth 3D bodies in SMPL-X format. It includes a diversity of body shapes, motions, skin tones, hair, and clothing. The clothing is realistically simulated on the moving bodies using commercial clothing physics simulation. We render varying numbers of people in realistic scenes with varied lighting and camera motions. We then train various HPS regressors using BEDLAM and achieve state-of-the-art accuracy on real-image benchmarks despite training with synthetic data. We use BEDLAM to gain insights into what model design choices are important for accuracy. With good synthetic training data, we find that a basic method like HMR approaches the accuracy of the current SOTA method (CLIFF). BEDLAM is useful for a variety of tasks and all images, ground truth bodies, 3D clothing, support code, and more are available for research purposes. Additionally, we provide detailed information about our synthetic data generation pipeline, enabling others to generate their own datasets. See the project page: //bedlam.is.tue.mpg.de/.
In this article, a benchmark for real-world bin packing problems is proposed. This dataset consists of 12 instances of varying levels of complexity regarding size (with the number of packages ranging from 38 to 53) and user-defined requirements. In fact, several real-world-oriented restrictions were taken into account to build these instances: i) item and bin dimensions, ii) weight restrictions, iii) affinities among package categories iv) preferences for package ordering and v) load balancing. Besides the data, we also offer an own developed Python script for the dataset generation, coined Q4RealBPP-DataGen. The benchmark was initially proposed to evaluate the performance of quantum solvers. Therefore, the characteristics of this set of instances were designed according to the current limitations of quantum devices. Additionally, the dataset generator is included to allow the construction of general-purpose benchmarks. The data introduced in this article provides a baseline that will encourage quantum computing researchers to work on real-world bin packing problems.
Multi-robot platforms are playing an increasingly important role in warehouse automation for efficient goods transport. This paper proposes a novel customization of a multi-robot system, called Tactile Mobile Manipulators (TacMMs). Each TacMM integrates a soft optical tactile sensor and a mobile robot with a load-lifting mechanism, enabling cooperative transportation in tasks requiring coordinated physical interaction. More specifically, we mount the TacTip (biomimetic optical tactile sensor) on the Distributed Organisation and Transport System (DOTS) mobile robot. The tactile information then helps the mobile robots adjust the relative robot-object pose, thereby increasing the efficiency of load-lifting tasks. This study compares the performance of using two TacMMs with tactile perception with traditional vision-based pose adjustment for load-lifting. The results show that the average success rate of the TacMMs (66%) is improved over a purely visual-based method (34%), with a larger improvement when the mass of the load was non-uniformly distributed. Although this initial study considers two TacMMs, we expect the benefits of tactile perception to extend to multiple mobile robots. Website: //sites.google.com/view/tacmms
The field of human-human-robot interaction (HHRI) uses social robots to positively influence how humans interact with each other. This objective requires models of human understanding that consider multiple humans in an interaction as a collective entity and represent the group dynamics that exist within it. Understanding group dynamics is important because these can influence the behaviors, attitudes, and opinions of each individual within the group, as well as the group as a whole. Such an understanding is also useful when personalizing an interaction between a robot and the humans in its environment, where a group-level model can facilitate the design of robot behaviors that are tailored to a given group, the dynamics that exist within it, and the specific needs and preferences of the individual interactants. In this paper, we highlight the need for group-level models of human understanding in human-human-robot interaction research and how these can be useful in developing personalization techniques. We survey existing models of group dynamics and categorize them into models of social dominance, affect, social cohesion, and conflict resolution. We highlight the important features these models utilize, evaluate their potential to capture interpersonal aspects of a social interaction, and highlight their value for personalization techniques. Finally, we identify directions for future work, and make a case for models of relational affect as an approach that can better capture group-level understanding of human-human interactions and be useful in personalizing human-human-robot interactions.
Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected in two aspects: 1) Reasoning paths. For many real-life applications, it is hard to accurately decompose a query simply by examining the query itself. Planning based on the specific visual content and the results of each step is usually required. 2) Flexible inputs and intermediate results. Input forms could be flexible for in-the-wild cases, and involves not only a single image or video but a mixture of videos and images, e.g., a user-view image with some reference videos. Besides, a complex reasoning process will also generate diverse multimodal intermediate results, e.g., video narrations, segmented video clips, etc. To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. Specifically, the Planner is capable of using natural language to plan which tool in Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously explore and discover the optimal solution. We conducted experiments on A-OKVQA and NExT-QA benchmarks, achieving state-of-the-art results. Moreover, showcases demonstrate the ability of our system to handle questions far more complex than those found in the benchmarks.
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.