Motion planning for articulated robots has traditionally been governed by algorithms that operate within manufacturer-defined payload limits. Our empirical analysis of the Franka Emika Panda robot demonstrates that this approach unnecessarily restricts the robot's dynamically-reachable task space. These results establish an expanded operational envelope for such robots, showing that they can handle payloads of more than twice their rated capacity. Additionally, our preliminary findings indicate that integrating non-prehensile motion primitives with grasping-based manipulation has the potential to further increase the success rates of manipulation tasks involving payloads exceeding nominal limits.
As recommender systems become increasingly prevalent, the environmental impact and energy efficiency of training large-scale models have come under scrutiny. This paper investigates the potential for energy-efficient algorithm performance by optimizing dataset sizes through downsampling techniques in the context of Green Recommender Systems. We conducted experiments on the MovieLens 100K, 1M, 10M, and Amazon Toys and Games datasets, analyzing the performance of various recommender algorithms under different portions of dataset size. Our results indicate that while more training data generally leads to higher algorithm performance, certain algorithms, such as FunkSVD and BiasedMF, particularly with unbalanced and sparse datasets like Amazon Toys and Games, maintain high-quality recommendations with up to a 50% reduction in training data, achieving nDCG@10 scores within approximately 13% of full dataset performance. These findings suggest that strategic dataset reduction can decrease computational and environmental costs without substantially compromising recommendation quality. This study advances sustainable and green recommender systems by providing insights for reducing energy consumption while maintaining effectiveness.
Continuous-time batch state estimation using Gaussian processes is an efficient approach to estimate the trajectories of robots over time. In the past, relatively simple physics-motivated priors have been considered for such approaches, using assumptions such as constant velocity or acceleration. This paper presents an approach to incorporating exogenous control inputs, such as velocity or acceleration commands, into the continuous Gaussian process state-estimation framework. It is shown that this approach generalizes across different domains in robotics, making it applicable to both the estimation of continuous-time trajectories for mobile robots and the estimation of quasi-static continuum robot shapes. Results show that incorporating control inputs leads to more informed priors, potentially requiring less measurements and estimation nodes to obtain accurate estimates. This makes the approach particularly useful in situations in which limited sensing is available.
A critical aspect of safe and efficient motion planning for autonomous vehicles (AVs) is to handle the complex and uncertain behavior of surrounding human-driven vehicles (HDVs). Despite intensive research on driver behavior prediction, existing approaches typically overlook the interactions between AVs and HDVs assuming that HDV trajectories are not affected by AV actions. To address this gap, we present a transformer-transfer learning-based interaction-aware trajectory predictor for safe motion planning of autonomous driving, focusing on a vehicle-to-vehicle (V2V) interaction scenario consisting of an AV and an HDV. Specifically, we construct a transformer-based interaction-aware trajectory predictor using widely available datasets of HDV trajectory data and further transfer the learned predictor using a small set of AV-HDV interaction data. Then, to better incorporate the proposed trajectory predictor into the motion planning module of AVs, we introduce an uncertainty quantification method to characterize the errors of the predictor, which are integrated into the path-planning process. Our experimental results demonstrate the value of explicitly considering interactions and handling uncertainties.
Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.
A common limitation of autonomous tissue manipulation in robotic minimally invasive surgery (MIS) is the absence of force sensing and control at the tool level. Recently, our team has developed miniature force-sensing forceps that can simultaneously measure the grasping and pulling forces during tissue manipulation. Based on this design, here we further present a method to automate tissue traction that comprises grasping and pulling stages. During this process, the grasping and pulling forces can be controlled either separately or simultaneously through force decoupling. The force controller is built upon a static model of tissue manipulation, considering the interaction between the force-sensing forceps and soft tissue. The efficacy of this force control approach is validated through a series of experiments comparing targeted, estimated, and actual reference forces. To verify the feasibility of the proposed method in surgical applications, various tissue resections are conducted on ex vivo tissues employing a dual-arm robotic setup. Finally, we discuss the benefits of multi-force control in tissue traction, evidenced through comparative analyses of various ex vivo tissue resections with and without the proposed method, and the potential generalization with traction on different tissues. The results affirm the feasibility of implementing automatic tissue traction using miniature forceps with multi-force control, suggesting its potential to promote autonomous MIS. A video demonstrating the experiments can be found at //youtu.be/f5gXuXe67Ak.
Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate high margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively and confidently use the logit margin to detect brittle decisions with such models. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to assess adversarial vulnerability in deployment scenarios efficiently.
In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years. In this paper, we give the first comprehensive review of these works and also provide experimental comparisons and analysis to better demonstrate the features and advantages of SSM. Specifically, we first give a detailed description of principles to help the readers quickly capture the key ideas of SSM. After that, we dive into the reviews of existing SSMs and their various applications, including natural language processing, computer vision, graph, multi-modal and multi-media, point cloud/event stream, time series data, and other domains. In addition, we give statistical comparisons and analysis of these models and hope it helps the readers to understand the effectiveness of different structures on various tasks. Then, we propose possible research points in this direction to better promote the development of the theoretical model and application of SSM. More related works will be continuously updated on the following GitHub: //github.com/Event-AHU/Mamba_State_Space_Model_Paper_List.
Robotic collectives for military and disaster response applications require coalition formation algorithms to partition robots into appropriate task teams. Collectives' missions will often incorporate tasks that require multiple high-level robot behaviors or services, which coalition formation must accommodate. The highly dynamic and unstructured application domains also necessitate that coalition formation algorithms produce near optimal solutions (i.e., >95% utility) in near real-time (i.e., <5 minutes) with very large collectives (i.e., hundreds of robots). No previous coalition formation algorithm satisfies these requirements. An initial evaluation found that traditional auction-based algorithms' runtimes are too long, even though the centralized simulator incorporated ideal conditions unlikely to occur in real-world deployments (i.e., synchronization across robots and perfect, instantaneous communication). The hedonic game-based GRAPE algorithm can produce solutions in near real-time, but cannot be applied to multiple service collectives. This manuscript integrates GRAPE and a services model, producing GRAPE-S and Pair-GRAPE-S. These algorithms and two auction baselines were evaluated using a centralized simulator with up to 1000 robots, and via the largest distributed coalition formation simulated evaluation to date, with up to 500 robots. The evaluations demonstrate that auctions transfer poorly to distributed collectives, resulting in excessive runtimes and low utility solutions. GRAPE-S satisfies the target domains' coalition formation requirements, producing near optimal solutions in near real-time, and Pair-GRAPE-S more than satisfies the domain requirements, producing optimal solutions in near real-time. GRAPE-S and Pair-GRAPE-S are the first algorithms demonstrated to support near real-time coalition formation for very large, distributed collectives with multiple services.
Transformer architectures have facilitated the development of large-scale and general-purpose sequence models for prediction tasks in natural language processing and computer vision, e.g., GPT-3 and Swin Transformer. Although originally designed for prediction problems, it is natural to inquire about their suitability for sequential decision-making and reinforcement learning problems, which are typically beset by long-standing issues involving sample efficiency, credit assignment, and partial observability. In recent years, sequence models, especially the Transformer, have attracted increasing interest in the RL communities, spawning numerous approaches with notable effectiveness and generalizability. This survey presents a comprehensive overview of recent works aimed at solving sequential decision-making tasks with sequence models such as the Transformer, by discussing the connection between sequential decision-making and sequence modeling, and categorizing them based on the way they utilize the Transformer. Moreover, this paper puts forth various potential avenues for future research intending to improve the effectiveness of large sequence models for sequential decision-making, encompassing theoretical foundations, network architectures, algorithms, and efficient training systems. As this article has been accepted by the Frontiers of Computer Science, here is an early version, and the most up-to-date version can be found at //journal.hep.com.cn/fcs/EN/10.1007/s11704-023-2689-5
Face recognition technology has advanced significantly in recent years due largely to the availability of large and increasingly complex training datasets for use in deep learning models. These datasets, however, typically comprise images scraped from news sites or social media platforms and, therefore, have limited utility in more advanced security, forensics, and military applications. These applications require lower resolution, longer ranges, and elevated viewpoints. To meet these critical needs, we collected and curated the first and second subsets of a large multi-modal biometric dataset designed for use in the research and development (R&D) of biometric recognition technologies under extremely challenging conditions. Thus far, the dataset includes more than 350,000 still images and over 1,300 hours of video footage of approximately 1,000 subjects. To collect this data, we used Nikon DSLR cameras, a variety of commercial surveillance cameras, specialized long-rage R&D cameras, and Group 1 and Group 2 UAV platforms. The goal is to support the development of algorithms capable of accurately recognizing people at ranges up to 1,000 m and from high angles of elevation. These advances will include improvements to the state of the art in face recognition and will support new research in the area of whole-body recognition using methods based on gait and anthropometry. This paper describes methods used to collect and curate the dataset, and the dataset's characteristics at the current stage.