In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrate that COSTREAM can produce highly accurate cost estimates for the initial operator placement and even generalize to unseen placements, queries, and hardware. When using COSTREAM to optimize the placements of streaming operators, a median speed-up of around 21x can be achieved compared to baselines.
This paper introduces RobotCycle, a novel ongoing project that leverages Autonomous Vehicle (AV) research to investigate how road infrastructure influences cyclist behaviour and safety during real-world journeys. The project's requirements were defined in collaboration with key stakeholders, including city planners, cyclists, and policymakers, informing the design of risk and safety metrics and the data collection criteria. We propose a data-driven approach relying on a novel, rich dataset of diverse traffic scenes and scenarios captured using a custom-designed wearable sensing unit. By analysing road-user trajectories, we identify normal path deviations indicating potential risks or hazardous interactions related to infrastructure elements in the environment. Our analysis correlates driving profiles and trajectory patterns with local road segments, driving conditions, and road-user interactions to predict traffic behaviours and identify critical scenarios. Moreover, by leveraging advancements in AV research, the project generates detailed 3D High-Definition Maps (HD Maps), traffic flow patterns, and trajectory models to provide a comprehensive assessment and analysis of the behaviour of all traffic agents. These data can then inform the design of cyclist-friendly road infrastructure, ultimately enhancing road safety and cyclability. The project provides valuable insights for enhancing cyclist protection and advancing sustainable urban mobility.
This paper presents the UniMER dataset to provide the first study on Mathematical Expression Recognition (MER) towards complex real-world scenarios. The UniMER dataset consists of a large-scale training set UniMER-1M offering an unprecedented scale and diversity with one million training instances and a meticulously designed test set UniMER-Test that reflects a diverse range of formula distributions prevalent in real-world scenarios. Therefore, the UniMER dataset enables the training of a robust and high-accuracy MER model and comprehensive evaluation of model performance. Moreover, we introduce the Universal Mathematical Expression Recognition Network (UniMERNet), an innovative framework designed to enhance MER in practical scenarios. UniMERNet incorporates a Length-Aware Module to process formulas of varied lengths efficiently, thereby enabling the model to handle complex mathematical expressions with greater accuracy. In addition, UniMERNet employs our UniMER-1M data and image augmentation techniques to improve the model's robustness under different noise conditions. Our extensive experiments demonstrate that UniMERNet outperforms existing MER models, setting a new benchmark in various scenarios and ensuring superior recognition quality in real-world applications. The dataset and model are available at //github.com/opendatalab/UniMERNet.
In this paper, we present PRISM, a Promptable and Robust Interactive Segmentation Model, aiming for precise segmentation of 3D medical images. PRISM accepts various visual inputs, including points, boxes, and scribbles as sparse prompts, as well as masks as dense prompts. Specifically, PRISM is designed with four principles to achieve robustness: (1) Iterative learning. The model produces segmentations by using visual prompts from previous iterations to achieve progressive improvement. (2) Confidence learning. PRISM employs multiple segmentation heads per input image, each generating a continuous map and a confidence score to optimize predictions. (3) Corrective learning. Following each segmentation iteration, PRISM employs a shallow corrective refinement network to reassign mislabeled voxels. (4) Hybrid design. PRISM integrates hybrid encoders to better capture both the local and global information. Comprehensive validation of PRISM is conducted using four public datasets for tumor segmentation in the colon, pancreas, liver, and kidney, highlighting challenges caused by anatomical variations and ambiguous boundaries in accurate tumor identification. Compared to state-of-the-art methods, both with and without prompt engineering, PRISM significantly improves performance, achieving results that are close to human levels. The code is publicly available at //github.com/MedICL-VU/PRISM.
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.
In this paper, we introduce X-Ray, an innovative approach to 3D generation that employs a new sequential representation, drawing inspiration from the depth-revealing capabilities of X-Ray scans to meticulously capture both the external and internal features of objects. Central to our method is the utilization of ray casting techniques originating from the camera's viewpoint, meticulously recording the geometric and textural details encountered across all intersected surfaces. This process efficiently condenses complete objects or scenes into a multi-frame format, just like videos. Such a structure ensures the 3D representation is composed solely of critical surface information. Highlighting the practicality and adaptability of our X-Ray representation, we showcase its utility in synthesizing 3D objects, employing a network architecture akin to that used in video diffusion models. The outcomes reveal our representation's superior performance in enhancing both the accuracy and efficiency of 3D synthesis, heralding new directions for ongoing research and practical implementations in the field.
This article presents Persistence Administered Collective Navigation (PACNav) as an approach for achieving decentralized collective navigation of Unmanned Aerial Vehicle (UAV) swarms. The technique is inspired by the flocking and collective navigation behavior observed in natural swarms, such as cattle herds, bird flocks, and even large groups of humans. PACNav relies solely on local observations of relative positions of UAVs, making it suitable for large swarms deprived of communication capabilities and external localization systems. We introduce the novel concepts of path persistence and path similarity, which allow each swarm member to analyze the motion of others. PACNav is grounded on two main principles: (1) UAVs with little variation in motion direction exhibit high path persistence and are considered reliable leaders by other UAVs; (2) groups of UAVs that move in a similar direction demonstrate high path similarity, and such groups are assumed to contain a reliable leader. The proposed approach also incorporates a reactive collision avoidance mechanism to prevent collisions with swarm members and environmental obstacles. The method is validated through simulated and real-world experiments conducted in a natural forest.
In this article, we propose an accuracy-assuring technique for finding a solution for unsymmetric linear systems. Such problems are related to different areas such as image processing, computer vision, and computational fluid dynamics. Parallel implementation of Krylov subspace methods speeds up finding approximate solutions for linear systems. In this context, the refined approach in pipelined BiCGStab enhances scalability on distributed memory machines, yielding to substantial speed improvements compared to the standard BiCGStab method. However, it's worth noting that the pipelined BiCGStab algorithm sacrifices some accuracy, which is stabilized with the residual replacement technique. This paper aims to address this issue by employing the ExBLAS-based reproducible approach. We validate the idea on a set of matrices from the SuiteSparse Matrix Collection.
In this paper, we present two novel methods in Network Intrusion Detection Systems (NIDS) using Graph Neural Networks (GNNs). The first approach, Scattering Transform with E-GraphSAGE (STEG), utilizes the scattering transform to conduct multi-resolution analysis of edge feature vectors. This provides a detailed representation that is essential for identifying subtle anomalies in network traffic. The second approach improves node representation by initiating with Node2Vec, diverging from standard methods of using uniform values, thereby capturing a more accurate and holistic network picture. Our methods have shown significant improvements in performance compared to existing state-of-the-art methods in benchmark NIDS datasets.
In this paper, we investigate the retrieval-augmented generation (RAG) based on Knowledge Graphs (KGs) to improve the accuracy and reliability of Large Language Models (LLMs). Recent approaches suffer from insufficient and repetitive knowledge retrieval, tedious and time-consuming query parsing, and monotonous knowledge utilization. To this end, we develop a Hypothesis Knowledge Graph Enhanced (HyKGE) framework, which leverages LLMs' powerful reasoning capacity to compensate for the incompleteness of user queries, optimizes the interaction process with LLMs, and provides diverse retrieved knowledge. Specifically, HyKGE explores the zero-shot capability and the rich knowledge of LLMs with Hypothesis Outputs to extend feasible exploration directions in the KGs, as well as the carefully curated prompt to enhance the density and efficiency of LLMs' responses. Furthermore, we introduce the HO Fragment Granularity-aware Rerank Module to filter out noise while ensuring the balance between diversity and relevance in retrieved knowledge. Experiments on two Chinese medical multiple-choice question datasets and one Chinese open-domain medical Q&A dataset with two LLM turbos demonstrate the superiority of HyKGE in terms of accuracy and explainability.
We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning a pre-trained NeRF LRM with mesh rendering. Moreover, we improve the LRM architecture by simplifying several complex designs in previous LRMs. MeshLRM's NeRF initialization is sequentially trained with low- and high-resolution images; this new LRM training strategy enables significantly faster convergence and thereby leads to better quality with less compute. Our approach achieves state-of-the-art mesh reconstruction from sparse-view inputs and also allows for many downstream applications, including text-to-3D and single-image-to-3D generation. Project page: //sarahweiii.github.io/meshlrm/