With the rapid advancement of autonomous driving technology, efficient and accurate object detection capabilities have become crucial factors in ensuring the safety and reliability of autonomous driving systems. However, in low-visibility environments such as hazy conditions, the performance of traditional object detection algorithms often degrades significantly, failing to meet the demands of autonomous driving. To address this challenge, this paper proposes two innovative deep learning models: YOLO-Vehicle and YOLO-Vehicle-Pro. YOLO-Vehicle is an object detection model tailored specifically for autonomous driving scenarios, employing multimodal fusion techniques to combine image and textual information for object detection. YOLO-Vehicle-Pro builds upon this foundation by introducing an improved image dehazing algorithm, enhancing detection performance in low-visibility environments. In addition to model innovation, this paper also designs and implements a cloud-edge collaborative object detection system, deploying models on edge devices and offloading partial computational tasks to the cloud in complex situations. Experimental results demonstrate that on the KITTI dataset, the YOLO-Vehicle-v1s model achieved 92.1% accuracy while maintaining a detection speed of 226 FPS and an inference time of 12ms, meeting the real-time requirements of autonomous driving. When processing hazy images, the YOLO-Vehicle-Pro model achieved a high accuracy of 82.3% mAP@50 on the Foggy Cityscapes dataset while maintaining a detection speed of 43 FPS.
Developing generalizable robot policies that can robustly handle varied environmental conditions and object instances remains a fundamental challenge in robot learning. While considerable efforts have focused on collecting large robot datasets and developing policy architectures to learn from such data, naively learning from visual inputs often results in brittle policies that fail to transfer beyond the training data. This work presents Prescriptive Point Priors for Policies or P3-PO, a novel framework that constructs a unique state representation of the environment leveraging recent advances in computer vision and robot learning to achieve improved out-of-distribution generalization for robot manipulation. This representation is obtained through two steps. First, a human annotator prescribes a set of semantically meaningful points on a single demonstration frame. These points are then propagated through the dataset using off-the-shelf vision models. The derived points serve as an input to state-of-the-art policy architectures for policy learning. Our experiments across four real-world tasks demonstrate an overall 43% absolute improvement over prior methods when evaluated in identical settings as training. Further, P3-PO exhibits 58% and 80% gains across tasks for new object instances and more cluttered environments respectively. Videos illustrating the robot's performance are best viewed at point-priors.github.io.
As command-line interfaces remain integral to high-performance computing environments, the risk of exploitation through stealthy and complex command-line abuse grows. Conventional security solutions struggle to detect these anomalies due to their context-specific nature, lack of labeled data, and the prevalence of sophisticated attacks like Living-off-the-Land (LOL). To address this gap, we introduce the Scalable Command-Line Anomaly Detection Engine (SCADE), a framework that combines global statistical models with local context-specific analysis for unsupervised anomaly detection. SCADE leverages novel statistical methods, including BM25 and Log Entropy, alongside dynamic thresholding to adaptively detect rare, malicious command-line patterns in low signal-to-noise ratio (SNR) environments. Experimental results show that SCADE achieves above 98% SNR in identifying anomalous behavior while minimizing false positives. Designed for scalability and precision, SCADE provides an innovative, metadata-enriched approach to anomaly detection, offering a robust solution for cybersecurity in high-computation environments. This work presents SCADE's architecture, detection methodology, and its potential for enhancing anomaly detection in enterprise systems. We argue that SCADE represents a significant advancement in unsupervised anomaly detection, offering a robust, adaptive framework for security analysts and researchers seeking to enhance detection accuracy in high-computation environments.
This paper investigates the problem of trajectory planning for autonomous vehicles at unsignalized intersections, specifically focusing on scenarios where the vehicle lacks the right of way and yet must cross safely. To address this issue, we have employed a method based on the Partially Observable Markov Decision Processes (POMDPs) framework designed for planning under uncertainty. The method utilizes the Adaptive Belief Tree (ABT) algorithm as an approximate solver for the POMDPs. We outline the POMDP formulation, beginning with discretizing the intersection's topology. Additionally, we present a dynamics model for the prediction of the evolving states of vehicles, such as their position and velocity. Using an observation model, we also describe the connection of those states with the imperfect (noisy) available measurements. Our results confirmed that the method is able to plan collision-free trajectories in a series of simulations utilizing real-world traffic data from aerial footage of two distinct intersections. Furthermore, we studied the impact of parameter adjustments of the ABT algorithm on the method's performance. This provides guidance in determining reasonable parameter settings, which is valuable for future method applications.
Visual servoing, the method of controlling robot motion through feedback from visual sensors, has seen significant advancements with the integration of optical flow-based methods. However, its application remains limited by inherent challenges, such as the necessity for a target image at test time, the requirement of substantial overlap between initial and target images, and the reliance on feedback from a single camera. This paper introduces Imagine2Servo, an innovative approach leveraging diffusion-based image editing techniques to enhance visual servoing algorithms by generating intermediate goal images. This methodology allows for the extension of visual servoing applications beyond traditional constraints, enabling tasks like long-range navigation and manipulation without predefined goal images. We propose a pipeline that synthesizes subgoal images grounded in the task at hand, facilitating servoing in scenarios with minimal initial and target image overlap and integrating multi-camera feedback for comprehensive task execution. Our contributions demonstrate a novel application of image generation to robotic control, significantly broadening the capabilities of visual servoing systems. Real-world experiments validate the effectiveness and versatility of the Imagine2Servo framework in accomplishing a variety of tasks, marking a notable advancement in the field of visual servoing.
The rise of large language models (LLMs) has highlighted the importance of prompt engineering as a crucial technique for optimizing model outputs. While experimentation with various prompting methods, such as Few-shot, Chain-of-Thought, and role-based techniques, has yielded promising results, these advancements remain fragmented across academic papers, blog posts and anecdotal experimentation. The lack of a single, unified resource to consolidate the field's knowledge impedes the progress of both research and practical application. This paper argues for the creation of an overarching framework that synthesizes existing methodologies into a cohesive overview for practitioners. Using a design-based research approach, we present the Prompt Canvas, a structured framework resulting from an extensive literature review on prompt engineering that captures current knowledge and expertise. By combining the conceptual foundations and practical strategies identified in prompt engineering, the Prompt Canvas provides a practical approach for leveraging the potential of Large Language Models. It is primarily designed as a learning resource for pupils, students and employees, offering a structured introduction to prompt engineering. This work aims to contribute to the growing discourse on prompt engineering by establishing a unified methodology for researchers and providing guidance for practitioners.
Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at //github.com/si0wang/VisVM.
Recent advances in deep learning and natural language generation have significantly improved image captioning, enabling automated, human-like descriptions for visual content. In this work, we apply these captioning techniques to generate clinician-like interpretations of ECG data. This study leverages existing ECG datasets accompanied by free-text reports authored by healthcare professionals (HCPs) as training data. These reports, while often inconsistent, provide a valuable foundation for automated learning. We introduce an encoder-decoder-based method that uses these reports to train models to generate detailed descriptions of ECG episodes. This represents a significant advancement in ECG analysis automation, with potential applications in zero-shot classification and automated clinical decision support. The model is tested on various datasets, including both 1- and 12-lead ECGs. It significantly outperforms the state-of-the-art reference model by Qiu et al., achieving a METEOR score of 55.53% compared to 24.51% achieved by the reference model. Furthermore, several key design choices are discussed, providing a comprehensive overview of current challenges and innovations in this domain. The source codes for this research are publicly available in our Git repository //git.zib.de/ableich/ecg-comment-generation-public
Despite impressive advancements in recent multimodal reasoning approaches, they are still limited in flexibility and efficiency, as these models typically process only a few fixed modality inputs and require updates to numerous parameters. This paper tackles these critical challenges and proposes CREMA, a generalizable, highly efficient, and modular modality-fusion framework that can incorporate any new modality to enhance video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map) from given videos without extra human annotation by leveraging sensors or existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. It helps compress information across various assisting modalities, maintaining computational efficiency in the LLM while improving performance. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including conventional VideoQA and Video-Audio/3D/Touch/Thermal QA, and achieve better/equivalent performance against strong multimodal LLMs, including OneLLM, BLIP-2, and SeViLA while reducing over 90% trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.
In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years. In this paper, we give the first comprehensive review of these works and also provide experimental comparisons and analysis to better demonstrate the features and advantages of SSM. Specifically, we first give a detailed description of principles to help the readers quickly capture the key ideas of SSM. After that, we dive into the reviews of existing SSMs and their various applications, including natural language processing, computer vision, graph, multi-modal and multi-media, point cloud/event stream, time series data, and other domains. In addition, we give statistical comparisons and analysis of these models and hope it helps the readers to understand the effectiveness of different structures on various tasks. Then, we propose possible research points in this direction to better promote the development of the theoretical model and application of SSM. More related works will be continuously updated on the following GitHub: //github.com/Event-AHU/Mamba_State_Space_Model_Paper_List.
Human intelligence thrives on the concept of cognitive synergy, where collaboration and information integration among different cognitive processes yield superior outcomes compared to individual cognitive processes in isolation. Although Large Language Models (LLMs) have demonstrated promising performance as general task-solving agents, they still struggle with tasks that require intensive domain knowledge and complex reasoning. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist refers to an intelligent agent that collaborates with multiple minds, combining their individual strengths and knowledge, to enhance problem-solving and overall performance in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. We have discovered that assigning multiple, fine-grained personas in LLMs elicits better problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, SPP effectively elicits internal knowledge acquisition abilities, reduces hallucination, and maintains strong reasoning capabilities. Code, data, and prompts can be found at: //github.com/MikeWangWZHL/Solo-Performance-Prompting.git.