Mixed Reality (MR) is constantly evolving to inspire new patterns of robot manipulation for more advanced Human- Robot Interaction under the 4th Industrial Revolution Paradigm. Consider that Mixed Reality aims to connect physical and digital worlds to provide special immersive experiences, it is necessary to establish the information exchange platform and robot control systems within the developed MR scenarios. In this work, we mainly present multiple effective motion control methods applied on different interactive robotic arms (e.g., UR5, UR5e, myCobot) for the Unity-based development of MR applications, including GUI control panel, text input control panel, end-effector object dynamic tracking and ROS-Unity digital-twin connection.
Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.
Contextualized embeddings are the preferred tool for modeling Lexical Semantic Change (LSC). Current evaluations typically focus on a specific task known as Graded Change Detection (GCD). However, performance comparison across work are often misleading due to their reliance on diverse settings. In this paper, we evaluate state-of-the-art models and approaches for GCD under equal conditions. We further break the LSC problem into Word-in-Context (WiC) and Word Sense Induction (WSI) tasks, and compare models across these different levels. Our evaluation is performed across different languages on eight available benchmarks for LSC, and shows that (i) APD outperforms other approaches for GCD; (ii) XL-LEXEME outperforms other contextualized models for WiC, WSI, and GCD, while being comparable to GPT-4; (iii) there is a clear need for improving the modeling of word meanings, as well as focus on how, when, and why these meanings change, rather than solely focusing on the extent of semantic change.
This paper explores the realm of abstractive text summarization through the lens of the SEASON (Salience Allocation as Guidance for Abstractive SummarizatiON) technique, a model designed to enhance summarization by leveraging salience allocation techniques. The study evaluates SEASON's efficacy by comparing it with prominent models like BART, PEGASUS, and ProphetNet, all fine-tuned for various text summarization tasks. The assessment is conducted using diverse datasets including CNN/Dailymail, SAMSum, and Financial-news based Event-Driven Trading (EDT), with a specific focus on a financial dataset containing a substantial volume of news articles from 2020/03/01 to 2021/05/06. This paper employs various evaluation metrics such as ROUGE, METEOR, BERTScore, and MoverScore to evaluate the performance of these models fine-tuned for generating abstractive summaries. The analysis of these metrics offers a thorough insight into the strengths and weaknesses demonstrated by each model in summarizing news dataset, dialogue dataset and financial text dataset. The results presented in this paper not only contribute to the evaluation of the SEASON model's effectiveness but also illuminate the intricacies of salience allocation techniques across various types of datasets.
Collision avoidance algorithms for Autonomous Surface Vehicles (ASV) that follow the Convention on the International Regulations for Preventing Collisions at Sea (COLREGs) have been proposed in recent years. However, it may be difficult and unsafe to follow COLREGs in congested waters, where multiple ASVs are navigating in the presence of static obstacles and strong currents, due to the complex interactions. To address this problem, we propose a decentralized multi-ASV collision avoidance policy based on Distributional Reinforcement Learning, which considers the interactions among ASVs as well as with static obstacles and current flows. We evaluate the performance of the proposed Distributional RL based policy against a traditional RL-based policy and two classical methods, Artificial Potential Fields (APF) and Reciprocal Velocity Obstacles (RVO), in simulation experiments, which show that the proposed policy achieves superior performance in navigation safety, while requiring minimal travel time and energy. A variant of our framework that automatically adapts its risk sensitivity is also demonstrated to improve ASV safety in highly congested environments.
Multi-Robot Path Planning (MRPP) on graphs, equivalently known as Multi-Agent Path Finding (MAPF), is a well-established NP-hard problem with critically important applications. As serial computation in (near)-optimally solving MRPP approaches the computation efficiency limit, parallelization offers a promising route to push the limit further, especially in handling hard or large MRPP instances. In this study, we initiated a \emph{targeted} parallelization effort to boost the performance of conflict-based search for MRPP. Specifically, when instances are relatively small but robots are densely packed with strong interactions, we apply a decentralized parallel algorithm that concurrently explores multiple branches that leads to markedly enhanced solution discovery. On the other hand, when instances are large with sparse robot-robot interactions, we prioritize node expansion and conflict resolution. Our innovative multi-threaded approach to parallelizing bounded-suboptimal conflict search-based algorithms demonstrates significant improvements over baseline serial methods in success rate or runtime. Our contribution further pushes the understanding of MRPP and charts a promising path for elevating solution quality and computational efficiency through parallel algorithmic strategies.
Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP. However, the susceptibility of recent Large Vision-Language Models to these attacks remains understudied. Furthermore, prior work's Typographic attacks against CLIP randomly sample a misleading class from a predefined set of categories. However, this simple strategy misses more effective attacks that exploit LVLM(s) stronger language skills. To address these issues, we first introduce a benchmark for testing Typographic attacks against LVLM(s). Moreover, we introduce two novel and more effective \textit{Self-Generated} attacks which prompt the LVLM to generate an attack against itself: 1) Class Based Attack where the LVLM (e.g. LLaVA) is asked which deceiving class is most similar to the target class and 2) Descriptive Attacks where a more advanced LVLM (e.g. GPT4-V) is asked to recommend a Typographic attack that includes both a deceiving class and description. Using our benchmark, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 33\%. We also uncover that attacks generated by one model (e.g. GPT-4V or LLaVA) are effective against the model itself and other models like InstructBLIP and MiniGPT4. Code: \url{//github.com/mqraitem/Self-Gen-Typo-Attack}
Open World Object Detection (OWOD) is a challenging and realistic task that extends beyond the scope of standard Object Detection task. It involves detecting both known and unknown objects while integrating learned knowledge for future tasks. However, the level of "unknownness" varies significantly depending on the context. For example, a tree is typically considered part of the background in a self-driving scene, but it may be significant in a household context. We argue that this contextual information should already be embedded within the known classes. In other words, there should be a semantic or latent structure relationship between the known and unknown items to be discovered. Motivated by this observation, we propose Hyp-OW, a method that learns and models hierarchical representation of known items through a SuperClass Regularizer. Leveraging this representation allows us to effectively detect unknown objects using a similarity distance-based relabeling module. Extensive experiments on benchmark datasets demonstrate the effectiveness of Hyp-OW, achieving improvement in both known and unknown detection (up to 6 percent). These findings are particularly pronounced in our newly designed benchmark, where a strong hierarchical structure exists between known and unknown objects. Our code can be found at //github.com/boschresearch/Hyp-OW
We introduce a multi-agent simulator for economic systems comprised of heterogeneous Households, heterogeneous Firms, Central Bank and Government agents, that could be subjected to exogenous, stochastic shocks. The interaction between agents defines the production and consumption of goods in the economy alongside the flow of money. Each agent can be designed to act according to fixed, rule-based strategies or learn their strategies using interactions with others in the simulator. We ground our simulator by choosing agent heterogeneity parameters based on economic literature, while designing their action spaces in accordance with real data in the United States. Our simulator facilitates the use of reinforcement learning strategies for the agents via an OpenAI Gym style environment definition for the economic system. We demonstrate the utility of our simulator by simulating and analyzing two hypothetical (yet interesting) economic scenarios. The first scenario investigates the impact of heterogeneous household skills on their learned preferences to work at different firms. The second scenario examines the impact of a positive production shock to one of two firms on its pricing strategy in comparison to the second firm. We aspire that our platform sets a stage for subsequent research at the intersection of artificial intelligence and economics.
Reasoning with knowledge expressed in natural language and Knowledge Bases (KBs) is a major challenge for Artificial Intelligence, with applications in machine reading, dialogue, and question answering. General neural architectures that jointly learn representations and transformations of text are very data-inefficient, and it is hard to analyse their reasoning process. These issues are addressed by end-to-end differentiable reasoning systems such as Neural Theorem Provers (NTPs), although they can only be used with small-scale symbolic KBs. In this paper we first propose Greedy NTPs (GNTPs), an extension to NTPs addressing their complexity and scalability limitations, thus making them applicable to real-world datasets. This result is achieved by dynamically constructing the computation graph of NTPs and including only the most promising proof paths during inference, thus obtaining orders of magnitude more efficient models. Then, we propose a novel approach for jointly reasoning over KBs and textual mentions, by embedding logic facts and natural language sentences in a shared embedding space. We show that GNTPs perform on par with NTPs at a fraction of their cost while achieving competitive link prediction results on large datasets, providing explanations for predictions, and inducing interpretable models. Source code, datasets, and supplementary material are available online at //github.com/uclnlp/gntp.
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.