With recent advancements in aerospace technology, the volume of unlabeled remote sensing image (RSI) data has increased dramatically. Effectively leveraging this data through self-supervised learning (SSL) is vital in the field of remote sensing. However, current methodologies, particularly contrastive learning (CL), a leading SSL method, encounter specific challenges in this domain. Firstly, CL often mistakenly identifies geographically adjacent samples with similar semantic content as negative pairs, leading to confusion during model training. Secondly, as an instance-level discriminative task, it tends to neglect the essential fine-grained features and complex details inherent in unstructured RSIs. To overcome these obstacles, we introduce SwiMDiff, a novel self-supervised pre-training framework designed for RSIs. SwiMDiff employs a scene-wide matching approach that effectively recalibrates labels to recognize data from the same scene as false negatives. This adjustment makes CL more applicable to the nuances of remote sensing. Additionally, SwiMDiff seamlessly integrates CL with a diffusion model. Through the implementation of pixel-level diffusion constraints, we enhance the encoder's ability to capture both the global semantic information and the fine-grained features of the images more comprehensively. Our proposed framework significantly enriches the information available for downstream tasks in remote sensing. Demonstrating exceptional performance in change detection and land-cover classification tasks, SwiMDiff proves its substantial utility and value in the field of remote sensing.
Photo-realistic avatar is a modern term referring to the digital asset that represents a human in computer graphic advance systems such as video games and simulation tools. These avatars utilize the advances in graphic technologies on both software and hardware aspects. While photorealistic avatars are increasingly used in industrial simulations, representing human factors such as human workers internal states, remains a challenge. This article addresses this issue by introducing the concept of MetaStates which are the digitization and representation of the psychophysiological states of a human worker in the digital world. The MetaStates influence the physical representation and performance of a digital human worker while performing a task. To demonstrate this concept the study presents a development of a photorealistic avatar which is integrated into a simulated environment and enhanced with a multi-level graphical representation of different psychophysiological states. This approach represents a major step forward in the use of digital humans for industrial simulations, allowing companies to better leverage the benefits of the Industrial Metaverse in their daily operations and simulations while keeping human workers at the center of the system.
Knowledge Graph Completion (KGC) is crucial for addressing knowledge graph incompleteness and supporting downstream applications. Many models have been proposed for KGC. They can be categorized into two main classes: triple-based and text-based approaches. Triple-based methods struggle with long-tail entities due to limited structural information and imbalanced entity distributions. Text-based methods alleviate this issue but require costly training for language models and specific finetuning for knowledge graphs, which limits their efficiency. To alleviate these limitations, in this paper, we propose KICGPT, a framework that integrates a large language model (LLM) and a triple-based KGC retriever. It alleviates the long-tail problem without incurring additional training overhead. KICGPT uses an in-context learning strategy called Knowledge Prompt, which encodes structural knowledge into demonstrations to guide the LLM. Empirical results on benchmark datasets demonstrate the effectiveness of KICGPT with smaller training overhead and no finetuning.
Intelligent reflecting surfaces (IRSs) have several prominent advantages, including improving the level of wireless communication security and privacy. In this work, we focus on the latter aspect and introduce a strategy to counteract the presence of passive eavesdroppers overhearing transmissions from a base station towards legitimate users that are facilitated by the presence of IRSs. Specifically, we envision a transmission scheme that cycles across a number of IRS-to-user assignments, and we select them in a near-optimal fashion, thus guaranteeing both a high data rate and a good secrecy rate. Unlike most of the existing works addressing passive eavesdropping, the strategy we envision has low complexity and is suitable for scenarios where nodes are equipped with a limited number of antennas. Through our performance evaluation, we highlight the trade-off between the legitimate users' data rate and secrecy rate, and how the system parameters affect such a trade-off.
Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability. Drawing inspiration from the knowledge-driven nature of human driving, we explore the question of how to instill similar capabilities into autonomous driving systems and summarize a paradigm that integrates an interactive environment, a driver agent, as well as a memory component to address this question. Leveraging large language models (LLMs) with emergent abilities, we propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. Extensive experiments prove DiLu's capability to accumulate experience and demonstrate a significant advantage in generalization ability over reinforcement learning-based methods. Moreover, DiLu is able to directly acquire experiences from real-world datasets which highlights its potential to be deployed on practical autonomous driving systems. To the best of our knowledge, we are the first to leverage knowledge-driven capability in decision-making for autonomous vehicles. Through the proposed DiLu framework, LLM is strengthened to apply knowledge and to reason causally in the autonomous driving domain. Project page: //pjlab-adg.github.io/DiLu/
Solely relying on test passing to evaluate Large Language Models (LLMs) for code synthesis may result in unfair assessment or promoting models with data leakage. As an alternative, we introduce CodeMind, a framework designed to gauge the code reasoning abilities of LLMs. CodeMind currently supports three code reasoning tasks: Independent Execution Reasoning (IER), Dependent Execution Reasoning (DER), and Specification Reasoning (SR). The first two evaluate models to predict the execution output of an arbitrary code or code the model could correctly synthesize. The third one evaluates the extent to which LLMs implement the specified expected behavior. Our extensive evaluation of nine LLMs across five benchmarks in two different programming languages using CodeMind shows that LLMs fairly follow control flow constructs and, in general, explain how inputs evolve to output, specifically for simple programs and the ones they can correctly synthesize. However, their performance drops for code with higher complexity, non-trivial logical and arithmetic operators, non-primitive types, and API calls. Furthermore, we observe that, while correlated, specification reasoning (essential for code synthesis) does not imply execution reasoning (essential for broader programming tasks such as testing and debugging): ranking LLMs based on test passing can be different compared to code reasoning.
Information retrieval is a rapidly evolving field. However it still faces significant limitations in the scientific and industrial vast amounts of information, such as semantic divergence and vocabulary gaps in sparse retrieval, low precision and lack of interpretability in semantic search, or hallucination and outdated information in generative models. In this paper, we introduce a two-block approach to tackle these hurdles for long documents. The first block enhances language understanding in sparse retrieval by query expansion to retrieve relevant documents. The second block deepens the result by providing comprehensive and informative answers to the complex question using only the information spread in the long document, enabling bidirectional engagement. At various stages of the pipeline, intermediate results are presented to users to facilitate understanding of the system's reasoning. We believe this bidirectional approach brings significant advancements in terms of transparency, logical thinking, and comprehensive understanding in the field of scientific information retrieval.
Eye tracking is routinely being incorporated into virtual reality (VR) systems. Prior research has shown that eye tracking data, if exposed, can be used for re-identification attacks. The state of our knowledge about currently existing privacy mechanisms is limited to privacy-utility trade-off curves based on data-centric metrics of utility, such as prediction error, and black-box threat models. We propose that for interactive VR applications, it is essential to consider user-centric notions of utility and a variety of threat models. We develop a methodology to evaluate real-time privacy mechanisms for interactive VR applications that incorporate subjective user experience and task performance metrics. We evaluate selected privacy mechanisms using this methodology and find that re-identification accuracy can be decreased to as low as 14% while maintaining a high usability score and reasonable task performance. Finally, we elucidate three threat scenarios (black-box, black-box with exemplars, and white-box) and assess how well the different privacy mechanisms hold up to these adversarial scenarios. This work advances the state of the art in VR privacy by providing a methodology for end-to-end assessment of the risk of re-identification attacks and potential mitigating solutions.
Gait recognition is one of the most promising video-based biometric technologies. The edge of silhouettes and motion are the most informative feature and previous studies have explored them separately and achieved notable results. However, due to occlusions and variations in viewing angles, their gait recognition performance is often affected by the predefined spatial segmentation strategy. Moreover, traditional temporal pooling usually neglects distinctive temporal information in gait. To address the aforementioned issues, we propose a novel gait recognition framework, denoted as GaitASMS, which can effectively extract the adaptive structured spatial representations and naturally aggregate the multi-scale temporal information. The Adaptive Structured Representation Extraction Module (ASRE) separates the edge of silhouettes by using the adaptive edge mask and maximizes the representation in semantic latent space. Moreover, the Multi-Scale Temporal Aggregation Module (MSTA) achieves effective modeling of long-short-range temporal information by temporally aggregated structure. Furthermore, we propose a new data augmentation, denoted random mask, to enrich the sample space of long-term occlusion and enhance the generalization of the model. Extensive experiments conducted on two datasets demonstrate the competitive advantage of proposed method, especially in complex scenes, i.e. BG and CL. On the CASIA-B dataset, GaitASMS achieves the average accuracy of 93.5\% and outperforms the baseline on rank-1 accuracies by 3.4\% and 6.3\%, respectively, in BG and CL. The ablation experiments demonstrate the effectiveness of ASRE and MSTA. The source code is available at //github.com/YanSungithub/GaitASMS.
With the extremely rapid advances in remote sensing (RS) technology, a great quantity of Earth observation (EO) data featuring considerable and complicated heterogeneity is readily available nowadays, which renders researchers an opportunity to tackle current geoscience applications in a fresh way. With the joint utilization of EO data, much research on multimodal RS data fusion has made tremendous progress in recent years, yet these developed traditional algorithms inevitably meet the performance bottleneck due to the lack of the ability to comprehensively analyse and interpret these strongly heterogeneous data. Hence, this non-negligible limitation further arouses an intense demand for an alternative tool with powerful processing competence. Deep learning (DL), as a cutting-edge technology, has witnessed remarkable breakthroughs in numerous computer vision tasks owing to its impressive ability in data representation and reconstruction. Naturally, it has been successfully applied to the field of multimodal RS data fusion, yielding great improvement compared with traditional methods. This survey aims to present a systematic overview in DL-based multimodal RS data fusion. More specifically, some essential knowledge about this topic is first given. Subsequently, a literature survey is conducted to analyse the trends of this field. Some prevalent sub-fields in the multimodal RS data fusion are then reviewed in terms of the to-be-fused data modalities, i.e., spatiospectral, spatiotemporal, light detection and ranging-optical, synthetic aperture radar-optical, and RS-Geospatial Big Data fusion. Furthermore, We collect and summarize some valuable resources for the sake of the development in multimodal RS data fusion. Finally, the remaining challenges and potential future directions are highlighted.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.