Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities.
Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects. Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition. As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks. SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions. Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: $23\%$ on CVBench, $8\%$ on the harder BLINK benchmark, and $18\%$ on VSR. When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning. Our data/code is available at //arijitray1993.github.io/SAT/ .
Agglomerative models have recently emerged as a powerful approach to training vision foundation models, leveraging multi-teacher distillation from existing models such as CLIP, DINO, and SAM. This strategy enables the efficient creation of robust models, combining the strengths of individual teachers while significantly reducing computational and resource demands. In this paper, we thoroughly analyze state-of-the-art agglomerative models, identifying critical challenges including resolution mode shifts, teacher imbalance, idiosyncratic teacher artifacts, and an excessive number of output tokens. To address these issues, we propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions. Specifically, in the context of Vision Language Models, we introduce a token compression technique to maintain high-resolution information within a fixed token count. We release our top-performing models, available in multiple scales (-B, -L, -H, and -g), alongside inference code and pretrained weights.
Recent advances in Meta-learning for Black-Box Optimization (MetaBBO) have shown the potential of using neural networks to dynamically configure evolutionary algorithms (EAs), enhancing their performance and adaptability across various BBO instances. However, they are often tailored to a specific EA, which limits their generalizability and necessitates retraining or redesigns for different EAs and optimization problems. To address this limitation, we introduce ConfigX, a new paradigm of the MetaBBO framework that is capable of learning a universal configuration agent (model) for boosting diverse EAs. To achieve so, our ConfigX first leverages a novel modularization system that enables the flexible combination of various optimization sub-modules to generate diverse EAs during training. Additionally, we propose a Transformer-based neural network to meta-learn a universal configuration policy through multitask reinforcement learning across a designed joint optimization task space. Extensive experiments verify that, our ConfigX, after large-scale pre-training, achieves robust zero-shot generalization to unseen tasks and outperforms state-of-the-art baselines. Moreover, ConfigX exhibits strong lifelong learning capabilities, allowing efficient adaptation to new tasks through fine-tuning. Our proposed ConfigX represents a significant step toward an automatic, all-purpose configuration agent for EAs.
Daily monitoring of intra-personal facial changes associated with health and emotional conditions has great potential to be useful for medical, healthcare, and emotion recognition fields. However, the approach for capturing intra-personal facial changes is relatively unexplored due to the difficulty of collecting temporally changing face images. In this paper, we propose a facial representation learning method using synthetic images for comparing faces, called ComFace, which is designed to capture intra-personal facial changes. For effective representation learning, ComFace aims to acquire two feature representations, i.e., inter-personal facial differences and intra-personal facial changes. The key point of our method is the use of synthetic face images to overcome the limitations of collecting real intra-personal face images. Facial representations learned by ComFace are transferred to three extensive downstream tasks for comparing faces: estimating facial expression changes, weight changes, and age changes from two face images of the same individual. Our ComFace, trained using only synthetic data, achieves comparable to or better transfer performance than general pre-training and state-of-the-art representation learning methods trained using real images.
Urban areas, as the primary human habitat in modern civilization, accommodate a broad spectrum of social activities. With the surge of embodied intelligence, recent years have witnessed an increasing presence of physical agents in urban areas, such as autonomous vehicles and delivery robots. As a result, practitioners significantly value crafting authentic, simulation-ready 3D cities to facilitate the training and verification of such agents. However, this task is quite challenging. Current generative methods fall short in either diversity, controllability, or fidelity. In this work, we resort to the procedural content generation (PCG) technique for high-fidelity generation. It assembles superior assets according to empirical rules, ultimately leading to industrial-grade outcomes. To ensure diverse and self contained creation, we design a management protocol to accommodate extensive PCG plugins with distinct functions and interfaces. Based on this unified PCG library, we develop a multi-agent framework to transform multi-modal instructions, including OSM, semantic maps, and satellite images, into executable programs. The programs coordinate relevant plugins to construct the 3D city consistent with the control condition. A visual feedback scheme is introduced to further refine the initial outcomes. Our method, named CityX, demonstrates its superiority in creating diverse, controllable, and realistic 3D urban scenes. The synthetic scenes can be seamlessly deployed as a real-time simulator and an infinite data generator for embodied intelligence research. Our project page: //cityx-lab.github.io.
Cooperative perception has attracted wide attention given its capability to leverage shared information across connected automated vehicles (CAVs) and smart infrastructures to address sensing occlusion and range limitation issues. However, existing research overlooks the fragile multi-sensor correlations in multi-agent settings, as the heterogeneous agent sensor measurements are highly susceptible to environmental factors, leading to weakened inter-agent sensor interactions. The varying operational conditions and other real-world factors inevitably introduce multifactorial noise and consequentially lead to multi-sensor misalignment, making the deployment of multi-agent multi-modality perception particularly challenging in the real world. In this paper, we propose AgentAlign, a real-world heterogeneous agent cross-modality feature alignment framework, to effectively address these multi-modality misalignment issues. Our method introduces a cross-modality feature alignment space (CFAS) and heterogeneous agent feature alignment (HAFA) mechanism to harmonize multi-modality features across various agents dynamically. Additionally, we present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions, facilitating a systematic evaluation of our approach's robustness. Extensive experiments on the V2X-Real and V2XSet-Noise benchmarks demonstrate that our framework achieves state-of-the-art performance, underscoring its potential for real-world applications in cooperative autonomous driving. The controllable V2XSet-Noise dataset and generation pipeline will be released in the future.
As cyber threats continue to evolve in sophistication and scale, the ability to detect anomalous network behavior has become critical for maintaining robust cybersecurity defenses. Modern cybersecurity systems face the overwhelming challenge of analyzing billions of daily network interactions to identify potential threats, making efficient and accurate anomaly detection algorithms crucial for network defense. This paper investigates the use of variations of the Isolation Forest (iForest) machine learning algorithm for detecting anomalies in internet scan data. In particular, it presents the Set-Partitioned Isolation Forest (siForest), a novel extension of the iForest method designed to detect anomalies in set-structured data. By treating instances such as sets of multiple network scans with the same IP address as cohesive units, siForest effectively addresses some challenges of analyzing complex, multidimensional datasets. Extensive experiments on synthetic datasets simulating diverse anomaly scenarios in network traffic demonstrate that siForest has the potential to outperform traditional approaches on some types of internet scan data.
Robotic devices hold great potential for efficient and reliable assessment of neuromotor abnormalities in post-stroke patients. However, spasticity caused by stroke is still assessed manually in clinical settings. The limited and variable nature of data collected from patients has long posed a major barrier to quantitatively modelling spasticity with robotic measurements and fully validating robotic assessment techniques. This paper presents a simulation framework developed to support the design and validation of elbow spasticity models and mitigate data problems. The framework consists of a simulation environment of robot-assisted spasticity assessment, two motion controllers for the robot and human models, and a stretch reflex controller. Our framework allows simulation based on synthetic data without experimental data from human subjects. Using this framework, we replicated the constant-velocity stretch experiment typically used in robot-assisted spasticity assessment and evaluated four types of spasticity models. Our results show that a spasticity reflex model incorporating feedback on both muscle fibre velocity and length more accurately captures joint resistance characteristics during passive elbow stretching in spastic patients than a force-dependent model. When integrated with an appropriate spasticity model, this simulation framework has the potential to generate extensive datasets of virtual patients for future research on spasticity assessment.
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities. This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated physical environments. We discuss the gaps in the existing benchmarks and aspects of commonsense reasoning that are not addressed in any existing benchmark. We conclude with a number of recommendations for future development of commonsense AI benchmarks.
Distant supervision can effectively label data for relation extraction, but suffers from the noise labeling problem. Recent works mainly perform soft bag-level noise reduction strategies to find the relatively better samples in a sentence bag, which is suboptimal compared with making a hard decision of false positive samples in sentence level. In this paper, we introduce an adversarial learning framework, which we named DSGAN, to learn a sentence-level true-positive generator. Inspired by Generative Adversarial Networks, we regard the positive samples generated by the generator as the negative samples to train the discriminator. The optimal generator is obtained until the discrimination ability of the discriminator has the greatest decline. We adopt the generator to filter distant supervision training dataset and redistribute the false positive instances into the negative set, in which way to provide a cleaned dataset for relation classification. The experimental results show that the proposed strategy significantly improves the performance of distant supervision relation extraction comparing to state-of-the-art systems.