This paper introduces CognitiveOS, a disruptive system based on multiple transformer-based models, endowing robots of various types with cognitive abilities not only for communication with humans but also for task resolution through physical interaction with the environment. The system operates smoothly on different robotic platforms without extra tuning. It autonomously makes decisions for task execution by analyzing the environment and using information from its long-term memory. The system underwent testing on various platforms, including quadruped robots and manipulator robots, showcasing its capability to formulate behavioral plans even for robots whose behavioral examples were absent in the training dataset. Experimental results revealed the system's high performance in advanced task comprehension and adaptability, emphasizing its potential for real-world applications. The chapters of this paper describe the key components of the system and the dataset structure. The dataset for fine-tuning step generation model is provided at the following link: link coming soon
Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then, it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves a 15.41% average relative performance gain in automated theorem proving, and 8.18% in math word problems. Codes and data are available at //github.com/Eleanor-H/MUSTARD.
Medical generative models, acknowledged for their high-quality sample generation ability, have accelerated the fast growth of medical applications. However, recent works concentrate on separate medical generation models for distinct medical tasks and are restricted to inadequate medical multi-modal knowledge, constraining medical comprehensive diagnosis. In this paper, we propose MedM2G, a Medical Multi-Modal Generative framework, with the key innovation to align, extract, and generate medical multi-modal within a unified model. Extending beyond single or two medical modalities, we efficiently align medical multi-modal through the central alignment approach in the unified space. Significantly, our framework extracts valuable clinical knowledge by preserving the medical visual invariant of each imaging modal, thereby enhancing specific medical information for multi-modal generation. By conditioning the adaptive cross-guided parameters into the multi-flow diffusion framework, our model promotes flexible interactions among medical multi-modal for generation. MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image, image-to-text, and unified generation of medical modalities (CT, MRI, X-ray). It performs 5 medical generation tasks across 10 datasets, consistently outperforming various state-of-the-art works.
Amidst task-specific learning-based control synthesis frameworks that achieve impressive empirical results, a unified framework that systematically constructs an optimal policy for sufficiently solving a general notion of a task is absent. Hence, we propose a theoretical framework for a task-centered control synthesis leveraging two critical ideas: 1) oracle-guided policy optimization for the non-limiting integration of sub-optimal task-based priors to guide the policy optimization and 2) task-vital multimodality to break down solving a task into executing a sequence of behavioral modes. The proposed approach results in highly agile parkour and diving on a 16-DoF dynamic bipedal robot. The obtained policy advances indefinitely on a track, performing leaps and jumps of varying lengths and heights for the parkour task. Corresponding to the dive task, the policy demonstrates front, back, and side flips from various initial heights. Finally, we introduce a novel latent mode space reachability analysis to study our policies' versatility and generalization by computing a feasible mode set function through which we certify a set of failure-free modes for our policy to perform at any given state.
Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt inputs, such as imprecise bounding boxes. In this paper, we explore the potential of applying SAM to track and segment objects in videos where we recognize the tracking task as a prompt denoising task. Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy where we provide multiple jittered and scaled box prompts for each object and preserve the mask prediction with the highest semantic similarity to the template mask. We also introduce a point-based refinement stage to handle occlusions and reduce cumulative errors. Without involving tracking modules, our approach demonstrates comparable performance in video object/instance segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO, serving as a concise baseline and endowing SAM-based downstream applications with tracking capabilities.
Beneficial to advanced computing devices, models with massive parameters are increasingly employed to extract more information to enhance the precision in describing and predicting the patterns of objective systems. This phenomenon is particularly pronounced in research domains associated with deep learning. However, investigations of causal relationships based on statistical and informational theories have posed an interesting and valuable challenge to large-scale models in the recent decade. Macroscopic models with fewer parameters can outperform their microscopic counterparts with more parameters in effectively representing the system. This valuable situation is called "Causal Emergence." This paper introduces a quantification framework, according to the Effective Information and Transition Probability Matrix, for assessing numerical conditions of Causal Emergence as theoretical constraints of its occurrence. Specifically, our results quantitatively prove the cause of Causal Emergence. By a particular coarse-graining strategy, optimizing uncertainty and asymmetry within the model's causal structure is significantly more influential than losing maximum information due to variations in model scales. Moreover, by delving into the potential exhibited by Partial Information Decomposition and Deep Learning networks in the study of Causal Emergence, we discuss potential application scenarios where our quantification framework could play a role in future investigations of Causal Emergence.
Deep Learning (DL) models have become crucial in digital transformation, thus raising concerns about their intellectual property rights. Different watermarking techniques have been developed to protect Deep Neural Networks (DNNs) from IP infringement, creating a competitive field for DNN watermarking and removal methods. The predominant watermarking schemes use white-box techniques, which involve modifying weights by adding a unique signature to specific DNN layers. On the other hand, existing attacks on white-box watermarking usually require knowledge of the specific deployed watermarking scheme or access to the underlying data for further training and fine-tuning. We propose DeepEclipse, a novel and unified framework designed to remove white-box watermarks. We present obfuscation techniques that significantly differ from the existing white-box watermarking removal schemes. DeepEclipse can evade watermark detection without prior knowledge of the underlying watermarking scheme, additional data, or training and fine-tuning. Our evaluation reveals that DeepEclipse excels in breaking multiple white-box watermarking schemes, reducing watermark detection to random guessing while maintaining a similar model accuracy as the original one. Our framework showcases a promising solution to address the ongoing DNN watermark protection and removal challenges.
This paper investigates the integration and assessment of IntelliGame, a gamification plugin initially designed for Java development, within the realm of JavaScript unit testing. We aim to verify the generalizability of IntelliGame to JavaScript development and to provide valuable insights into the experiment's design. For this, we first customize IntelliGame for JavaScript, and then conduct a controlled experiment involving 152 participants utilizing the Jest testing framework, and finally examine its influence on testing behavior and the overall developer experience. The findings from this study provide valuable insights for improving JavaScript testing methodologies through the incorporation of gamification.
Neural machine translation (NMT) has progressed rapidly in the past few years, promising improvements and quality translations for different languages. Evaluation of this task is crucial to determine the quality of the translation. Overall, insufficient emphasis is placed on the actual sense of the translation in traditional methods. We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text. This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet. Through the calculation of the semantic distance between the source and its back translation of the output, our method introduces a quantifiable approach that empowers sentence comparison on the same linguistic level. Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair. Finally, our method proposes a new multilingual approach to rank MT systems without the need for parallel corpora.
Knowledge distillation, the technique of transferring knowledge from large, complex models to smaller ones, marks a pivotal step towards efficient AI deployment. Distilling Step-by-Step (DSS), a novel method utilizing chain-of-thought (CoT) distillation, has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. In DSS, the distilled model acquires the ability to generate rationales and predict labels concurrently through a multi-task learning framework. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. To this end, we investigate the mutual relationship of the two tasks from Information Bottleneck perspective and formulate it as maximizing the mutual information of the representation features of the two tasks. We propose a variational approach to solve this optimization problem using a learning-based method. Our experimental results across four datasets demonstrate that our method outperforms the state-of-the-art DSS. Our findings offer insightful guidance for future research on language model distillation as well as applications involving CoT. Code and models will be released soon.
We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.