亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.

相關內容

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources for end-to-end training, and (ii) how to effectively leverage the language-aware capability for region-level cross-modality understanding. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which pre-trains on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enable the language-aware ability of the model through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmark datasets, achieving state-of-the-art results with an AP of 50.6\% on the COCO dataset and 40.0\% on the LVIS dataset in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4\% AP, outperforming many existing methods with the same backbone. The code for OV-DINO will be available at \href{//github.com/wanghao9610/OV-DINO}{//github.com/wanghao9610/OV-DINO}.

We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.

Binary similarity involves determining whether two binary programs exhibit similar functionality, often originating from the same source code. In this work, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR). We extract the embeddings from sequences of basic blocks, termed peepholes, derived by random walks on the control-flow graph. The peepholes are normalized using transformations inspired by compiler optimizations. The VEX-IR Normalization Engine mitigates, with these transformations, the architectural and compiler-induced variations in binaries while exposing semantic similarities. We then learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. This vocabulary is used to derive function embeddings for similarity assessment using VexNet, a feed-forward Siamese network designed to position similar functions closely and separate dissimilar ones in an n-dimensional space. This approach is amenable for both diffing and searching tasks, ensuring robustness against Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. In diffing experiments, VexIR2Vec outperforms the nearest baselines by $40\%$, $18\%$, $21\%$, and $60\%$ in cross-optimization, cross-compilation, cross-architecture, and obfuscation settings, respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of $0.76$, outperforming the nearest baseline by $46\%$. Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open-source tools. VexIR2Vec is $3.1$-$3.5 \times$ faster than the closest baselines and orders-of-magnitude faster than other tools.

The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core components. 1) Language-driven Quality Assessment (LQA): Leveraging a pretrained vision-language model with a prompt learner, the LQA recalibrates image contributions without requiring additional quality annotations. This approach effectively mitigates the impact of noisy inputs. 2) Conditional Dropout (CD): A learning method to strengthen the model's adaptability in scenarios with missing modalities, while preserving its performance under complete modalities. The CD serves as a plug-in training scheme that treats modality-missing as conditions, strengthening the overall robustness of various dual-modal SOD models. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions. We will release source code upon acceptance.

As the application of federated learning becomes increasingly widespread, the issue of imbalanced training data distribution has emerged as a significant challenge. Federated learning utilizes local data stored on different training clients for model training, rather than centralizing data on a server, thereby greatly enhancing the privacy and security of training data. However, the distribution of training data across different clients may be imbalanced, with different categories of data potentially residing on different clients. This presents a challenge to traditional federated learning, which assumes data distribution is independent and identically distributed (IID). This paper proposes a Blockchain-based Federated Learning Model for Non-IID Data (BFLN), which combines federated learning with blockchain technology. By introducing a new aggregation method and incentive algorithm, BFLN enhances the model performance of federated learning on non-IID data. Experiments on public datasets demonstrate that, compared to other state-of-the-art models, BFLN improves training accuracy and provides a sustainable incentive mechanism for personalized federated learning.

Model predictive control (MPC) has been applied to many platforms in robotics and autonomous systems for its capability to predict a system's future behavior while incorporating constraints that a system may have. To enhance the performance of a system with an MPC controller, one can manually tune the MPC's cost function. However, it can be challenging due to the possibly high dimension of the parameter space as well as the potential difference between the open-loop cost function in MPC and the overall closed-loop performance metric function. This paper presents DiffTune-MPC, a novel learning method, to learn the cost function of an MPC in a closed-loop manner. The proposed framework is compatible with the scenario where the time interval for performance evaluation and MPC's planning horizon have different lengths. We show the auxiliary problem whose solution admits the analytical gradients of MPC and discuss its variations in different MPC settings, including nonlinear MPCs that are solved using sequential quadratic programming. Simulation results demonstrate the learning capability of DiffTune-MPC and the generalization capability of the learned MPC parameters.

Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over past three years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this article, we present a systematic survey of recent progress to bridge this gap. We begin by briefly introducing a background that formally defines various 3D multi-modal tasks and summarizes their inherent challenges. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.

Multi-modal fusion is a fundamental task for the perception of an autonomous driving system, which has recently intrigued many researchers. However, achieving a rather good performance is not an easy task due to the noisy raw data, underutilized information, and the misalignment of multi-modal sensors. In this paper, we provide a literature review of the existing multi-modal-based methods for perception tasks in autonomous driving. Generally, we make a detailed analysis including over 50 papers leveraging perception sensors including LiDAR and camera trying to solve object detection and semantic segmentation tasks. Different from traditional fusion methodology for categorizing fusion models, we propose an innovative way that divides them into two major classes, four minor classes by a more reasonable taxonomy in the view of the fusion stage. Moreover, we dive deep into the current fusion methods, focusing on the remaining problems and open-up discussions on the potential research opportunities. In conclusion, what we expect to do in this paper is to present a new taxonomy of multi-modal fusion methods for the autonomous driving perception tasks and provoke thoughts of the fusion-based techniques in the future.

Visual information extraction (VIE) has attracted considerable attention recently owing to its various advanced applications such as document understanding, automatic marking and intelligent education. Most existing works decoupled this problem into several independent sub-tasks of text spotting (text detection and recognition) and information extraction, which completely ignored the high correlation among them during optimization. In this paper, we propose a robust visual information extraction system (VIES) towards real-world scenarios, which is a unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction by taking a single document image as input and outputting the structured information. Specifically, the information extraction branch collects abundant visual and semantic representations from text spotting for multimodal feature fusion and conversely, provides higher-level semantic clues to contribute to the optimization of text spotting. Moreover, regarding the shortage of public benchmarks, we construct a fully-annotated dataset called EPHOIE (//github.com/HCIILAB/EPHOIE), which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances. Compared with the state-of-the-art methods, our VIES shows significant superior performance on the EPHOIE dataset and achieves a 9.01% F-score gain on the widely used SROIE dataset under the end-to-end scenario.

Most object recognition approaches predominantly focus on learning discriminative visual patterns while overlooking the holistic object structure. Though important, structure modeling usually requires significant manual annotations and therefore is labor-intensive. In this paper, we propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions into the traditional framework. We show the recognition backbone can be substantially enhanced for more robust representation learning, without any cost of extra annotation and inference speed. Specifically, we first propose an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category. We then design a spatial context learning module for modeling the internal structures of the object, through predicting the relative positions within the extent. These two modules can be easily plugged into any backbone networks during training and detached at inference time. Extensive experiments show that our look-into-object approach (LIO) achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft). We also show that this learning paradigm is highly generalizable to other tasks such as object detection and segmentation (MS COCO). Project page: //github.com/JDAI-CV/LIO.

北京阿比特科技有限公司