Numerous mobile robots with mounted Ultraviolet-C (UV-C) lamps were developed recently, yet they cannot work in the same space as humans without irradiating them by UV-C. This paper proposes a novel modular and scalable Human-Aware Genetic-based Coverage Path Planning algorithm (GHACPP), that aims to solve the problem of disinfecting of unknown environments by UV-C irradiation and preventing human eyes and skin from being harmed. The proposed genetic-based algorithm alternates between the stages of exploring a new area, generating parts of the resulting disinfection trajectory, called mini-trajectories, and updating the current state around the robot. The system performance in effectiveness and human safety is validated and compared with one of the latest state-of-the-art online coverage path planning algorithms called SimExCoverage-STC. The experimental results confirmed both the high level of safety for humans and the efficiency of the developed algorithm in terms of decrease of path length (by 37.1%), number (39.5%) and size (35.2%) of turns, and time (7.6%) to complete the disinfection task, with a small loss in the percentage of area covered (0.6%), in comparison with the state-of-the-art approach.
Metamodel-based DSL development in language workbenches like Xtext allows language engineers to focus more on metamodels and domain concepts rather than grammar details. However, the grammar generated from metamodels often requires manual modification, which can be tedious and time-consuming. Especially when it comes to rapid prototyping and language evolution, the grammar will be generated repeatedly, this means that language engineers need to repeat such manual modification back and forth. Previous work introduced GrammarOptimizer, which automatically improves the generated grammar using optimization rules. However, the optimization rules need to be configured manually, which lacks user-friendliness and convenience. In this paper, we present our vision for and current progress towards a language workbench that integrates GrammarOptimizer's grammar optimization rules to support rapid prototyping and evolution of metamodel-based languages. It provides a visual configuration of optimization rules and a real-time preview of the effects of grammar optimization to address the limitations of GrammarOptimizer. Furthermore, it supports the inference of a grammar based on examples from model instances and offers a selection of language styles. These features aim to enhance the automation level of metamodel-based DSL development with Xtext and assist language engineers in iterative development and rapid prototyping. Our paper discusses the potential and applications of this language workbench, as well as how it fills the gaps in existing language workbenches.
To facilitate the research on intelligent and human-like chatbots with multi-modal context, we introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context. Compared to previous multi-modal dialogue datasets, the richer context types in TikTalk lead to more diverse conversations, but also increase the difficulty in capturing human interests from intricate multi-modal information to generate personalized responses. Moreover, external knowledge is more frequently evoked in our dataset. These facts reveal new challenges for multi-modal dialogue models. We quantitatively demonstrate the characteristics of TikTalk, propose a video-based multi-modal chitchat task, and evaluate several dialogue baselines. Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall. Furthermore, no existing model can solve all the above challenges well. There is still a large room for future improvements, even for LLM with visual extensions. Our dataset is available at \url{//ruc-aimind.github.io/projects/TikTalk/}.
Autonomous 3D part assembly is a challenging task in the areas of robotics and 3D computer vision. This task aims to assemble individual components into a complete shape without relying on predefined instructions. In this paper, we formulate this task from a novel generative perspective, introducing the Score-based 3D Part Assembly framework (Score-PA) for 3D part assembly. Knowing that score-based methods are typically time-consuming during the inference stage. To address this issue, we introduce a novel algorithm called the Fast Predictor-Corrector Sampler (FPC) that accelerates the sampling process within the framework. We employ various metrics to assess assembly quality and diversity, and our evaluation results demonstrate that our algorithm outperforms existing state-of-the-art approaches. We release our code at //github.com/J-F-Cheng/Score-PA_Score-based-3D-Part-Assembly.
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in EEG signal processing. Transformer models can capture the global dependencies in EEG signals through a self-attention mechanism, while CNN models can capture local features such as sawtooth waves. In this work, we propose an end-to-end neural epilepsy detection model, EENED, that combines CNN and Transformer. Specifically, by introducing the convolution module into the Transformer encoder, EENED can learn the time-dependent relationship of the patient's EEG signal features and notice local EEG abnormal mutations closely related to epilepsy, such as the appearance of spikes and the sprinkling of sharp and slow waves. Our proposed framework combines the ability of Transformer and CNN to capture different scale features of EEG signals and holds promise for improving the accuracy and reliability of epilepsy detection. Our source code will be released soon on GitHub.
Recent advances in omnidirectional cameras and AR/VR headsets have spurred the adoption of 360-degree videos that are widely believed to be the future of online video streaming. 360-degree videos allow users to wear a head-mounted display (HMD) and experience the video as if they are physically present in the scene. Streaming high-quality 360-degree videos at scale is an unsolved problem that is more challenging than traditional (2D) video delivery. The data rate required to stream 360-degree videos is an order of magnitude more than traditional videos. Further, the penalty for rebuffering events where the video freezes or displays a blank screen is more severe as it may cause cybersickness. We propose an online adaptive bitrate (ABR) algorithm for 360-degree videos called BOLA360 that runs inside the client's video player and orchestrates the download of video segments from the server so as to maximize the quality-of-experience (QoE) of the user. BOLA360 conserves bandwidth by downloading only those video segments that are likely to fall within the field-of-view (FOV) of the user. In addition, BOLA360 continually adapts the bitrate of the downloaded video segments so as to enable a smooth playback without rebuffering. We prove that BOLA360 is near-optimal with respect to an optimal offline algorithm that maximizes QoE. Further, we evaluate BOLA360 on a wide range of network and user head movement profiles and show that it provides $13.6\%$ to $372.5\%$ more QoE than state-of-the-art algorithms. While ABR algorithms for traditional (2D) videos have been well-studied over the last decade, our work is the first ABR algorithm for 360-degree videos with both theoretical and empirical guarantees on its performance.
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at //github.com/sony/bigvsan.
Integrating contact-awareness into a soft snake robot and efficiently controlling its locomotion in response to contact information present significant challenges. This paper aims to solve contact-aware locomotion problem of a soft snake robot through developing bio-inspired contact-aware locomotion controllers. To provide effective contact information for the controllers, we develop a scale covered sensor structure mimicking natural snakes' \textit{scale sensilla}. In the design of control framework, our core contribution is the development of a novel sensory feedback mechanism of the Matsuoka central pattern generator (CPG) network. This mechanism allows the Matsuoka CPG system to work like a "spine cord" in the whole contact-aware control scheme, which simultaneously takes the stimuli including tonic input signals from the "brain" (a goal-tracking locomotion controller) and sensory feedback signals from the "reflex arc" (the contact reactive controller), and generate rhythmic signals to effectively actuate the soft snake robot to slither through densely allocated obstacles. In the design of the "reflex arc", we develop two types of reactive controllers -- 1) a reinforcement learning (RL) sensor regulator that learns to manipulate the sensory feedback inputs of the CPG system, and 2) a local reflexive sensor-CPG network that directly connects sensor readings and the CPG's feedback inputs in a special topology. These two reactive controllers respectively facilitate two different contact-aware locomotion control schemes. The two control schemes are tested and evaluated in the soft snake robot, showing promising performance in the contact-aware locomotion tasks. The experimental results also further verify the benefit of Matsuoka CPG system in bio-inspired robot controller design.
Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Although many models have been proposed and have achieved good performance by learning short-range local features, long-range contextual properties have often been neglected. Moreover, the model size has also become a bottleneck for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net for large scale place recognition. Specifically, on top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short-range local features and long-range contextual features in this model. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art on benchmark datasets in terms of both accuracy and speed with a super-light model size (0.9M). Meanwhile, two simplified versions of SVT-Net are introduced, which also achieve state-of-the-art and further reduce the model size to 0.8M and 0.4M respectively.
Multi-agent influence diagrams (MAIDs) are a popular form of graphical model that, for certain classes of games, have been shown to offer key complexity and explainability advantages over traditional extensive form game (EFG) representations. In this paper, we extend previous work on MAIDs by introducing the concept of a MAID subgame, as well as subgame perfect and trembling hand perfect equilibrium refinements. We then prove several equivalence results between MAIDs and EFGs. Finally, we describe an open source implementation for reasoning about MAIDs and computing their equilibria.
Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.