Perceiving and manipulating 3D articulated objects in diverse environments is essential for home-assistant robots. Recent studies have shown that point-level affordance provides actionable priors for downstream manipulation tasks. However, existing works primarily focus on single-object scenarios with homogeneous agents, overlooking the realistic constraints imposed by the environment and the agent's morphology, e.g., occlusions and physical limitations. In this paper, we propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints. Unlike object-centric affordance approaches, learning environment-aware affordance faces the challenge of combinatorial explosion due to the complexity of various occlusions, characterized by their quantities, geometries, positions and poses. To address this and enhance data efficiency, we introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations. Experiments demonstrate the effectiveness of our proposed approach in learning affordance considering environment constraints. Project page at //chengkaiacademycity.github.io/EnvAwareAfford/
Task assignment and scheduling algorithms are powerful tools for autonomously coordinating large teams of robotic or AI agents. However, the decisions these system make often rely on components designed by domain experts, which can be difficult for non-technical end-users to understand or modify to their own ends. In this paper we propose a preliminary design for a flexible natural language interface for a task assignment system. The goal of our approach is both to grant users more control over a task assignment system's decision process, as well as render these decisions more transparent. Users can direct the task assignment system via natural language commands, which are applied as constraints to a mixed-integer linear program (MILP) using a large language model (LLM). Additionally, our proposed system can alert users to potential issues with their commands, and engage them in a corrective dialogue in order to find a viable solution. We conclude with a description of our planned user-evaluation in the simulated environment Overcooked and describe next steps towards developing a flexible and transparent task allocation system.
Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at \href{//github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main}{here}.
Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at //github.com/fregu856/ebms_3dod.
For autonomous vehicles (AVs), visual perception techniques based on sensors like cameras play crucial roles in information acquisition and processing. In various computer perception tasks for AVs, it may be helpful to match landmark patches taken by an onboard camera with other landmark patches captured at a different time or saved in a street scene image database. To perform matching under challenging driving environments caused by changing seasons, weather, and illumination, we utilize the spatial neighborhood information of each patch. We propose an approach, named RobustMat, which derives its robustness to perturbations from neural differential equations. A convolutional neural ODE diffusion module is used to learn the feature representation for the landmark patches. A graph neural PDE diffusion module then aggregates information from neighboring landmark patches in the street scene. Finally, feature similarity learning outputs the final matching score. Our approach is evaluated on several street scene datasets and demonstrated to achieve state-of-the-art matching results under environmental perturbations.
The potential of Martian lava tubes for resource extraction and habitat sheltering highlights the need for robots capable to undertake the grueling task of their exploration. Driven by this motivation, in this work we introduce a legged robot system optimized for jumping in the low gravity of Mars, designed with leg configurations adaptable to both bipedal and quadrupedal systems. This design utilizes torque-controlled actuators coupled with springs for high-power jumping, robust locomotion, and an energy-efficient resting pose. Key design features include a 5-bar mechanism as leg concept, combined with springs connected by a high-strength cord. The selected 5-bar link lengths and spring stiffness were optimized for maximizing the jump height in Martian gravity and realized as a robot leg. Two such legs combined with a compact body allowed jump testing of a bipedal prototype. The robot is 0.472 m tall and weighs 7.9 kg. Jump testing with significant safety margins resulted in a measured jump height of 1.141 m in Earth's gravity, while a total of 4 jumping experiments are presented. Simulations utilizing the full motor torque and kinematic limits of the design resulted in a maximum possible jump height of 1.52 m in Earth's gravity and 3.63 m in Mars' gravity, highlighting the versatility of jumping as a form of locomotion and overcoming obstacles in lower gravity.
Our goal is to perform out-of-distribution (OOD) detection, i.e., to detect when a robot is operating in environments drawn from a different distribution than the ones used to train the robot. We leverage Probably Approximately Correct (PAC)-Bayes theory to train a policy with a guaranteed bound on performance on the training distribution. Our idea for OOD detection relies on the following intuition: violation of the performance bound on test environments provides evidence that the robot is operating OOD. We formalize this via statistical techniques based on p-values and concentration inequalities. The approach provides guaranteed confidence bounds on OOD detection including bounds on both the false positive and false negative rates of the detector and is task-driven and only sensitive to changes that impact the robot's performance. We demonstrate our approach in simulation and hardware for a grasping task using objects with unfamiliar shapes or poses and a drone performing vision-based obstacle avoidance in environments with wind disturbances and varied obstacle densities. Our examples demonstrate that we can perform task-driven OOD detection within just a handful of trials.
Model Predictive Control (MPC) is a popular strategy for controlling robots but is difficult for systems with contact due to the complex nature of hybrid dynamics. To implement MPC for systems with contact, dynamic models are often simplified or contact sequences fixed in time in order to plan trajectories efficiently. In this work, we extend Hybrid iterative Linear Quadratic Regulator to work in a MPC fashion (HiLQR MPC) by 1) modifying how the cost function is computed when contact modes do not align, 2) utilizing parallelizations when simulating rigid body dynamics, and 3) using efficient analytical derivative computations of the rigid body dynamics. The result is a system that can modify the contact sequence of the reference behavior and plan whole body motions cohesively -- which is crucial when dealing with large perturbations. HiLQR MPC is tested on two systems: first, the hybrid cost modification is validated on a simple actuated bouncing ball hybrid system. Then HiLQR MPC is compared against methods that utilize centroidal dynamic assumptions on a quadruped robot (Unitree A1). HiLQR MPC outperforms the centroidal methods in both simulation and hardware tests.
Existing knowledge graph (KG) embedding models have primarily focused on static KGs. However, real-world KGs do not remain static, but rather evolve and grow in tandem with the development of KG applications. Consequently, new facts and previously unseen entities and relations continually emerge, necessitating an embedding model that can quickly learn and transfer new knowledge through growth. Motivated by this, we delve into an expanding field of KG embedding in this paper, i.e., lifelong KG embedding. We consider knowledge transfer and retention of the learning on growing snapshots of a KG without having to learn embeddings from scratch. The proposed model includes a masked KG autoencoder for embedding learning and update, with an embedding transfer strategy to inject the learned knowledge into the new entity and relation embeddings, and an embedding regularization method to avoid catastrophic forgetting. To investigate the impacts of different aspects of KG growth, we construct four datasets to evaluate the performance of lifelong KG embedding. Experimental results show that the proposed model outperforms the state-of-the-art inductive and lifelong embedding baselines.
Translational distance-based knowledge graph embedding has shown progressive improvements on the link prediction task, from TransE to the latest state-of-the-art RotatE. However, N-1, 1-N and N-N predictions still remain challenging. In this work, we propose a novel translational distance-based approach for knowledge graph link prediction. The proposed method includes two-folds, first we extend the RotatE from 2D complex domain to high dimension space with orthogonal transforms to model relations for better modeling capacity. Second, the graph context is explicitly modeled via two directed context representations. These context representations are used as part of the distance scoring function to measure the plausibility of the triples during training and inference. The proposed approach effectively improves prediction accuracy on the difficult N-1, 1-N and N-N cases for knowledge graph link prediction task. The experimental results show that it achieves better performance on two benchmark data sets compared to the baseline RotatE, especially on data set (FB15k-237) with many high in-degree connection nodes.
Detecting carried objects is one of the requirements for developing systems to reason about activities involving people and objects. We present an approach to detect carried objects from a single video frame with a novel method that incorporates features from multiple scales. Initially, a foreground mask in a video frame is segmented into multi-scale superpixels. Then the human-like regions in the segmented area are identified by matching a set of extracted features from superpixels against learned features in a codebook. A carried object probability map is generated using the complement of the matching probabilities of superpixels to human-like regions and background information. A group of superpixels with high carried object probability and strong edge support is then merged to obtain the shape of the carried object. We applied our method to two challenging datasets, and results show that our method is competitive with or better than the state-of-the-art.