The ability of living organisms to perform complex high speed manoeuvers in flight with a very small number of neurons and an incredibly low failure rate highlights the efficacy of these resource-constrained biological systems. Event-driven hardware has emerged, in recent years, as a promising avenue for implementing complex vision tasks in resource-constrained environments. Vision-based autonomous navigation and obstacle avoidance consists of several independent but related tasks such as optical flow estimation, depth estimation, Simultaneous Localization and Mapping (SLAM), object detection, and recognition. To ensure coherence between these tasks, it is imperative that they be trained on a single dataset. However, most existing datasets provide only a selected subset of the required data. This makes inter-network coherence difficult to achieve. Another limitation of existing datasets is the limited temporal resolution they provide. To address these limitations, we present FEDORA, a first-of-its-kind fully synthetic dataset for vision-based tasks, with ground truths for depth, pose, ego-motion, and optical flow. FEDORA is the first dataset to provide optical flow at three different frequencies - 10Hz, 25Hz, and 50Hz
Persons with visual impairments (PwVI) have difficulties understanding and navigating spaces around them. Current wayfinding technologies either focus solely on navigation or provide limited communication about the environment. Motivated by recent advances in visual-language grounding and semantic navigation, we propose DRAGON, a guiding robot powered by a dialogue system and the ability to associate the environment with natural language. By understanding the commands from the user, DRAGON is able to guide the user to the desired landmarks on the map, describe the environment, and answer questions from visual observations. Through effective utilization of dialogue, the robot can ground the user's free-form descriptions to landmarks in the environment, and give the user semantic information through spoken language. We conduct a user study with blindfolded participants in an everyday indoor environment. Our results demonstrate that DRAGON is able to communicate with the user smoothly, provide a good guiding experience, and connect users with their surrounding environment in an intuitive manner.
Deep reinforcement learning (DRL) is a promising method to learn control policies for robots only from demonstration and experience. To cover the whole dynamic behaviour of the robot, DRL training is an active exploration process typically performed in simulation environments. Although this simulation training is cheap and fast, applying DRL algorithms to real-world settings is difficult. If agents are trained until they perform safely in simulation, transferring them to physical systems is difficult due to the sim-to-real gap caused by the difference between the simulation dynamics and the physical robot. In this paper, we present a method of online training a DRL agent to drive autonomously on a physical vehicle by using a model-based safety supervisor. Our solution uses a supervisory system to check if the action selected by the agent is safe or unsafe and ensure that a safe action is always implemented on the vehicle. With this, we can bypass the sim-to-real problem while training the DRL algorithm safely, quickly, and efficiently. We compare our method with conventional learning in simulation and on a physical vehicle. We provide a variety of real-world experiments where we train online a small-scale vehicle to drive autonomously with no prior simulation training. The evaluation results show that our method trains agents with improved sample efficiency while never crashing, and the trained agents demonstrate better driving performance than those trained in simulation.
Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
Quantum Internetworking is a recent field that promises numerous interesting applications, many of which require the distribution of entanglement between arbitrary pairs of users. This work deals with the problem of scheduling in an arbitrary entanglement swapping quantum network - often called first generation quantum network - in its general topology, multicommodity, loss-aware formulation. We introduce a linear algebraic framework that exploits quantum memory through the creation of intermediate entangled links. The framework is then employed to mathematically derive a natural class of quadratic scheduling policies for quantum networks by applying Lyapunov Drift Minimization, a standard technique in classical network science. Moreover, an additional class of Max-Weight inspired policies is proposed and benchmarked, reducing significantly the computation cost, at the price of a slight performance degradation. The policies are compared in terms of information availability, localization and overall network performance through an ad-hoc simulator that admits user-provided network topologies and scheduling policies in order to showcase the potential application of the provided tools to quantum network design.
Memory is an important cognitive function for humans. How a brain with such a small power can complete such a complex memory function, the working mechanism behind this is undoubtedly fascinating. Engram theory views memory as the co-activation of specific neuronal clusters. From the perspective of graph theory, nodes represent neurons, and directed edges represent synapses. Then the memory engram is the connected subgraph formed between the activated nodes. In this paper, we use subgraphs as physical carriers of information and propose a parallel distributed information storage algorithm based on node scale in active-directed graphs. An active-directed graph is defined as a graph in which each node has autonomous and independent behavior and relies only on information obtained within the local field of view to make decisions. Unlike static directed graphs used for recording facts, active-directed graphs are decentralized like biological neuron networks and do not have a super manager who has a global view and can control the behavior of each node. Distinct from traditional algorithms with a global field of view, this algorithm is characterized by nodes collaborating globally on resource usage through their limited local field of view. While this strategy may not achieve global optimality as well as algorithms with a global field of view, it offers better robustness, concurrency, decentralization, and bioviability. Finally, it was tested in network capacity, fault tolerance, and robustness. It was found that the algorithm exhibits a larger network capacity in a more sparse network structure because the subgraph generated by a single sample is not a whole but consists of multiple weakly connected components. In this case, the network capacity can be understood as the number of permutations of several weakly connected components in the network.
The ability to accurately represent and localise relevant objects is essential for robots to carry out tasks effectively. Traditional approaches, where robots simply capture an image, process that image to take an action, and then forget the information, have proven to struggle in the presence of occlusions. Methods using multi-view perception, which have the potential to address some of these problems, require a world model that guides the collection, integration and extraction of information from multiple viewpoints. Furthermore, constructing a generic representation that can be applied in various environments and tasks is a difficult challenge. In this paper, a novel approach for building generic representations in occluded agro-food environments using multi-view perception and 3D multi-object tracking is introduced. The method is based on a detection algorithm that generates partial point clouds for each detected object, followed by a 3D multi-object tracking algorithm that updates the representation over time. The accuracy of the representation was evaluated in a real-world environment, where successful representation and localisation of tomatoes in tomato plants were achieved, despite high levels of occlusion, with the total count of tomatoes estimated with a maximum error of 5.08% and the tomatoes tracked with an accuracy up to 71.47%. Novel tracking metrics were introduced, demonstrating that valuable insight into the errors in localising and representing the fruits can be provided by their use. This approach presents a novel solution for building representations in occluded agro-food environments, demonstrating potential to enable robots to perform tasks effectively in these challenging environments.
Applications of the ensemble Kalman filter to high-dimensional problems are feasible only with small ensembles. This necessitates a kind of regularization of the analysis (observation update) problem. We propose a regularization technique based on a new non-stationary, non-parametric spatial model on the sphere. The model termed the Locally Stationary Convolution Model is a constrained version of the general Gaussian process convolution model. The constraints on the location-dependent convolution kernel include local isotropy, positive definiteness as a function of distance, and smoothness as a function of location. The model allows for a rigorous definition of the local spectrum, which is required to be a smooth function of spatial wavenumber. We propose and test an ensemble filter in which prior covariances are postulated to obey the Locally Stationary Convolution Model. The model is estimated online in a two-stage procedure. First, ensemble perturbations are bandpass filtered in several wavenumber bands to extract aggregated local spatial spectra. Second, a neural network recovers the local spectra from sample variances of the filtered fields. In simulation experiments, the new filter was capable of outperforming several existing techniques. With small to moderate ensemble sizes, the improvement was substantial.
Video compression relies heavily on exploiting the temporal redundancy between video frames, which is usually achieved by estimating and using the motion information. The motion information is represented as optical flows in most of the existing deep video compression networks. Indeed, these networks often adopt pre-trained optical flow estimation networks for motion estimation. The optical flows, however, may be less suitable for video compression due to the following two factors. First, the optical flow estimation networks were trained to perform inter-frame prediction as accurately as possible, but the optical flows themselves may cost too many bits to encode. Second, the optical flow estimation networks were trained on synthetic data, and may not generalize well enough to real-world videos. We address the twofold limitations by enhancing the optical flows in two stages: offline and online. In the offline stage, we fine-tune a trained optical flow estimation network with the motion information provided by a traditional (non-deep) video compression scheme, e.g. H.266/VVC, as we believe the motion information of H.266/VVC achieves a better rate-distortion trade-off. In the online stage, we further optimize the latent features of the optical flows with a gradient descent-based algorithm for the video to be compressed, so as to enhance the adaptivity of the optical flows. We conduct experiments on a state-of-the-art deep video compression scheme, DCVC. Experimental results demonstrate that the proposed offline and online enhancement together achieves on average 12.8% bitrate saving on the tested videos, without increasing the model or computational complexity of the decoder side.
A long-lasting goal of robotics research is to operate robots safely, while achieving high performance which often involves fast motions. Traditional motor-driven systems frequently struggle to balance these competing demands. Addressing this trade-off is crucial for advancing fields such as manufacturing and healthcare, where seamless collaboration between robots and humans is essential. We introduce a four degree-of-freedom (DoF) tendon-driven robot arm, powered by pneumatic artificial muscles (PAMs), to tackle this challenge. Our new design features low friction, passive compliance, and inherent impact resilience, enabling rapid, precise, high-force, and safe interactions during dynamic tasks. In addition to fostering safer human-robot collaboration, the inherent safety properties are particularly beneficial for reinforcement learning, where the robot's ability to explore dynamic motions without causing self-damage is crucial. We validate our robotic arm through various experiments, including long-term dynamic motions, impact resilience tests, and assessments of its ease of control. On a challenging dynamic table tennis task, we further demonstrate our robot's capabilities in rapid and precise movements. By showcasing our new design's potential, we aim to inspire further research on robotic systems that balance high performance and safety in diverse tasks. Our open-source hardware design, software, and a large dataset of diverse robot motions can be found at //webdav.tuebingen.mpg.de/pamy2/.
Generative AI tools such as chatGPT are poised to change the way people engage with online information. Recently, Microsoft announced their "new Bing" search system which incorporates chat and generative AI technology from OpenAI. Google has announced plans to deploy search interfaces that incorporate similar types of technology. These new technologies will transform how people can search for information. The research presented here is an early investigation into how people make use of a generative AI chat system (referred to simply as chat from here on) as part of a search process, and how the incorporation of chat systems with existing search tools may effect users search behaviors and strategies. We report on an exploratory user study with 10 participants who used a combined Chat+Search system that utilized the OpenAI GPT-3.5 API and the Bing Web Search v5 API. Participants completed three search tasks. In this pre-print paper of preliminary results, we report on ways that users integrated AI chat into their search process, things they liked and disliked about the chat system, their trust in the chat responses, and their mental models of how the chat system generated responses.