In this paper, we target at the text-to-audio grounding issue, namely, grounding the segments of the sound event described by a natural language query in the untrimmed audio. This is a newly proposed but challenging audio-language task, since it requires to not only precisely localize all the on- and off-sets of the desired segments in the audio, but also perform comprehensive acoustic and linguistic understandings and reason the multimodal interactions between the audio and query. To tackle those problems, the existing methods often holistically treat the query as a single unit by a global query representation. We argue that this approach suffers from several limitations. Motivated by the above considerations, we propose a novel Cross-modal Graph Interaction (CGI) model, which comprehensively models the comprehensive relations between the words in a query through a novel language graph. To capture the fine-grained interactions between the audio and query, a cross-modal attention module is introduced to assign higher weights to the keywords with more important semantics and generate the snippet-specific query representations. Furthermore, we design a cross-gating module to emphasize the crucial parts and weaken the irrelevant ones in the audio and query. We extensively evaluate the proposed CGI model on the public Audiogrounding dataset with significant improvements over several state-of-the-art methods. The ablation study demonstrate the consistent effectiveness of different modules in our model.
In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5x inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel.
Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels $K(x,y) = - \Vert x-y\Vert^r$, $r \in (0,2)$ have exceptional properties which allow their efficient computation. We prove that the MMD of Riesz kernels, which is also known as energy distance, coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for $r=1$, a simple sorting algorithm can be applied to reduce the complexity from $O(MN+N^2)$ to $O((M+N)\log(M+N))$ for two measures with $M$ and $N$ support points. As another interesting follow-up result, the MMD of compactly supported measures can be estimated from above and below by the Wasserstein-1 distance. For the implementations we approximate the gradient of the sliced MMD by using only a finite number $P$ of slices. We show that the resulting error has complexity $O(\sqrt{d/P})$, where $d$ is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for image applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.
This paper presents a novel soft tactile skin (STS) technology operating with sound waves. In this innovative approach, the sound waves generated by a speaker travel in channels embedded in a soft membrane and get modulated due to a deformation of the channel when pressed by an external force and received by a microphone at the end of the channel. The sensor leverages regression and classification methods for estimating the normal force and its contact location. Our sensor can be affixed to any robot part, e.g., end effectors or arm. We tested several regression and classifier methods to learn the relation between sound wave modulation, the applied force, and its location, respectively and picked the best-performing models for force and location predictions. Our novel tactile sensor yields 93% of the force estimation within 1.5 N tolerances for a range of 0-30+1 N and estimates contact locations with over 96% accuracy. We also demonstrated the performance of STS technology for a real-time gripping force control application.
In this paper, we take the initiative to investigate the performance of LLMs on complex planning tasks that require LLMs to understand a virtual spatial environment simulated via natural language and act correspondingly in text. We propose a benchmark named Natural Language Planning and Action (Natala) composed of a set of novel tasks: Brick World, NLVR-based Manipulations, and Natural Language Navigation. We found that current popular LLMs such as ChatGPT still lack abilities in complex planning. This arises a question -- do the LLMs have a good understanding of the environments described in natural language, or maybe other alternatives such as symbolic representations are neater and hence better to be understood by LLMs? To this end, we propose a novel method called CoS (Chain-of-Symbol Prompting) that represents the complex environments with condensed symbolic spatial representations during the chained intermediate thinking steps. CoS is easy to use and does not need additional training on LLMs. Extensive experiments indicate that CoS clearly surpasses the performance of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT. The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%) on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate steps from demonstrations on Brick World. Code and data available at: //github.com/hanxuhu/chain-of-symbol-planning
In this paper we propose a method for defending against an eavesdropper that uses a Deep Neural Network (DNN) for learning the modulation of wireless communication signals. Our method is based on manipulating the emitted waveform with the aid of a continuous time frequency-modulated (FM) obfuscating signal that is mixed with the modulated data. The resulting waveform allows a legitimate receiver (LRx) to demodulate the data but it increases the test error of a pre-trained or adversarially-trained DNN classifier at the eavesdropper. The scheme works for analog modulation and digital single carrier and multi carrier orthogonal frequency division multiplexing (OFDM) waveforms, while it can implemented in frame-based wireless protocols. The results indicate that careful selection of the parameters of the obfuscating waveform can drop classification performance at the eavesdropper to less than 10% in AWGN and fading channels with no performance loss at the LRx.
In this paper, we address the problem of sim-to-real transfer for object segmentation when there is no access to real examples of an object of interest during training, i.e. zero-shot sim-to-real transfer for segmentation. We focus on the application of shipwreck segmentation in side scan sonar imagery. Our novel segmentation network, STARS, addresses this challenge by fusing a predicted deformation field and anomaly volume, allowing it to generalize better to real sonar images and achieve more effective zero-shot sim-to-real transfer for image segmentation. We evaluate the sim-to-real transfer capabilities of our method on a real, expert-labeled side scan sonar dataset of shipwrecks collected from field work surveys with an autonomous underwater vehicle (AUV). STARS is trained entirely in simulation and performs zero-shot shipwreck segmentation with no additional fine-tuning on real data. Our method provides a significant 20% increase in segmentation performance for the targeted shipwreck class compared to the best baseline.
In this paper, we propose a vision-based solution for indoor Micro Air Vehicle (MAV) navigation, with a primary focus on its application within autonomous warehouses. Our work centers on the utilization of a single camera as the primary sensor for tasks such as detection, localization, and path planning. To achieve these objectives, we implement the HSV color detection and the Hough Line Transform for effective line detection within warehouse environments. The integration of a Kalman filter into our system enables the camera to track yellow lines reliably. We evaluated the performance of our vision-based line following algorithm through various MAV flight tests conducted in the Gazebo 11 platform, utilizing ROS Noetic. The results of these simulations demonstrate the system capability to successfully navigate narrow indoor spaces. Our proposed system has the potential to significantly reduce labor costs and enhance overall productivity in warehouse operations. This work contributes to the growing field of MAV applications in autonomous warehouses, addressing the need for efficient logistics and supply chain solutions.
Many Transformer-based pre-trained models for code have been developed and applied to code-related tasks. In this paper, we review the existing literature, examine the suitability of model architectures for different tasks, and look at the generalization ability of models on different datasets, and their resource consumption. We examine three very representative pre-trained models for code: CodeBERT, CodeGPT, and CodeT5, and conduct experiments on the top-4 most targeted software engineering tasks that we found in our literature survey: Code Summarization, Bug Fixing, Bug Detection, and Code Search. In our study, we showcase the capability of decoder-only models (CodeGPT) for specific generation tasks under state-of-the-art evaluation metrics and contest the common belief that the encoder-decoder architecture is optimal for general-purpose coding tasks. Additionally, we found that the most frequently used models are not necessarily the most suitable for certain applications and the developers' needs are not adequately addressed by current research. As well, we found that the benchmark and frequent dataset for Bug Fixing and Code Summarization both fail to enable models to generalize onto other datasets for the same task (the frequent dataset refers to the dataset with the highest frequency used in literature other than the benchmark). We use statistical testing to support our conclusions from experiments. Finally, CodeBERT is highly efficient for understanding tasks, whereas CodeT5's efficiency for generation tasks is in doubt, as the highest resource consumption does not guarantee a consistent better performance on different metrics. We also discuss the numerous practical issues in advancing future research on transformer-based models for code-related tasks.
In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
In this paper, we introduce the Reinforced Mnemonic Reader for machine reading comprehension tasks, which enhances previous attentive readers in two aspects. First, a reattention mechanism is proposed to refine current attentions by directly accessing to past attentions that are temporally memorized in a multi-round alignment architecture, so as to avoid the problems of attention redundancy and attention deficiency. Second, a new optimization approach, called dynamic-critical reinforcement learning, is introduced to extend the standard supervised method. It always encourages to predict a more acceptable answer so as to address the convergence suppression problem occurred in traditional reinforcement learning algorithms. Extensive experiments on the Stanford Question Answering Dataset (SQuAD) show that our model achieves state-of-the-art results. Meanwhile, our model outperforms previous systems by over 6% in terms of both Exact Match and F1 metrics on two adversarial SQuAD datasets.