苍井空无码免费换线,2021精品一级毛片一区二区

In recent years, there have been significant advancements in 3D reconstruction and dense RGB-D SLAM systems. One notable development is the application of Neural Radiance Fields (NeRF) in these systems, which utilizes implicit neural representation to encode 3D scenes. This extension of NeRF to SLAM has shown promising results. However, the depth images obtained from consumer-grade RGB-D sensors are often sparse and noisy, which poses significant challenges for 3D reconstruction and affects the accuracy of the representation of the scene geometry. Moreover, the original hierarchical feature grid with occupancy value is inaccurate for scene geometry representation. Furthermore, the existing methods select random pixels for camera tracking, which leads to inaccurate localization and is not robust in real-world indoor environments. To this end, we present NeSLAM, an advanced framework that achieves accurate and dense depth estimation, robust camera tracking, and realistic synthesis of novel views. First, a depth completion and denoising network is designed to provide dense geometry prior and guide the neural implicit representation optimization. Second, the occupancy scene representation is replaced with Signed Distance Field (SDF) hierarchical scene representation for high-quality reconstruction and view synthesis. Furthermore, we also propose a NeRF-based self-supervised feature tracking algorithm for robust real-time tracking. Experiments on various indoor datasets demonstrate the effectiveness and accuracy of the system in reconstruction, tracking quality, and novel view synthesis.

相關內容

穩健性

關注 3

MoDELS · Performer · state-of-the-art · HTTPS · 圖像修復 ·

2024 年 5 月 21 日

Ship in Sight: Diffusion Models for Ship-Image Super Resolution

Luigi Sigillo,Riccardo Fosco Gramaccioni,Alessandro Nicolosi,Danilo Comminiello

from arxiv, Accepted at 2024 International Joint Conference on Neural Networks (IJCNN)

In recent years, remarkable advancements have been achieved in the field of image generation, primarily driven by the escalating demand for high-quality outcomes across various image generation subtasks, such as inpainting, denoising, and super resolution. A major effort is devoted to exploring the application of super-resolution techniques to enhance the quality of low-resolution images. In this context, our method explores in depth the problem of ship image super resolution, which is crucial for coastal and port surveillance. We investigate the opportunity given by the growing interest in text-to-image diffusion models, taking advantage of the prior knowledge that such foundation models have already learned. In particular, we present a diffusion-model-based architecture that leverages text conditioning during training while being class-aware, to best preserve the crucial details of the ships during the generation of the super-resoluted image. Since the specificity of this task and the scarcity availability of off-the-shelf data, we also introduce a large labeled ship dataset scraped from online ship images, mostly from ShipSpotting\footnote{\url{www.shipspotting.com}} website. Our method achieves more robust results than other deep learning models previously employed for super resolution, as proven by the multiple experiments performed. Moreover, we investigate how this model can benefit downstream tasks, such as classification and object detection, thus emphasizing practical implementation in a real-world scenario. Experimental results show flexibility, reliability, and impressive performance of the proposed framework over state-of-the-art methods for different tasks. The code is available at: //github.com/LuigiSigillo/ShipinSight .

偏移量 · 通用動力公司 · NeRF · 數據集 · Performer ·

2024 年 5 月 21 日

Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos

Seoha Kim,Jeongmin Bae,Youngsik Yun,Hahyun Lee,Gun Bang,Youngjung Uh

from arxiv, I need to revise the text (it takes more than a month)

Recent advancements in 4D scene reconstruction using neural radiance fields (NeRF) have demonstrated the ability to represent dynamic scenes from multi-view videos. However, they fail to reconstruct the dynamic scenes and struggle to fit even the training views in unsynchronized settings. It happens because they employ a single latent embedding for a frame while the multi-view images at the same frame were actually captured at different moments. To address this limitation, we introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF. By design, our method is applicable for various baselines and improves them with large margins. Furthermore, finding the offsets naturally works as synchronizing the videos without manual effort. Experiments are conducted on the common Plenoptic Video Dataset and a newly built Unsynchronized Dynamic Blender Dataset to verify the performance of our method. Project page: //seoha-kim.github.io/sync-nerf

大語言模型 · Performer · 推斷 · 估計/估計量 · 優化器 ·

2024 年 5 月 21 日

Vidur: A Large-Scale Simulation Framework For LLM Inference

Amey Agrawal,Nitin Kedia,Jayashree Mohan,Ashish Panwar,Nipun Kwatra,Bhargav Gulavani,Ramachandran Ramjee,Alexey Tumanov

Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at //github.com/microsoft/vidur.

穩健性 · TOOLS · Performer · Notability · 可交換的 ·

2024 年 5 月 20 日

AgentScope: A Flexible yet Robust Multi-Agent Platform

Dawei Gao,Zitao Li,Xuchen Pan,Weirui Kuang,Zhijian Ma,Bingchen Qian,Fei Wei,Wenhao Zhang,Yuexiang Xie,Daoyuan Chen,Liuyi Yao,Hongyi Peng,Zeyu Zhang,Lin Zhu,Chen Cheng,Hongzhu Shi,Yaliang Li,Bolin Ding,Jingren Zhou

from arxiv, We have released code on //github.com/modelscope/agentscope

With the rapid advancement of Large Language Models (LLMs), significant progress has been made in multi-agent applications. However, the complexities in coordinating agents' cooperation and LLMs' erratic performance pose notable challenges in developing robust and efficient multi-agent applications. To tackle these challenges, we propose AgentScope, a developer-centric multi-agent platform with message exchange as its core communication mechanism. The abundant syntactic tools, built-in agents and service functions, user-friendly interfaces for application demonstration and utility monitor, zero-code programming workstation, and automatic prompt tuning mechanism significantly lower the barriers to both development and deployment. Towards robust and flexible multi-agent application, AgentScope provides both built-in and customizable fault tolerance mechanisms. At the same time, it is also armed with system-level support for managing and utilizing multi-modal data, tools, and external knowledge. Additionally, we design an actor-based distribution framework, enabling easy conversion between local and distributed deployments and automatic parallel optimization without extra effort. With these features, AgentScope empowers developers to build applications that fully realize the potential of intelligent agents. We have released AgentScope at //github.com/modelscope/agentscope, and hope AgentScope invites wider participation and innovation in this fast-moving field.

Color · MoDELS · GroupViT · Vision · 語言模型化 ·

2024 年 5 月 19 日

ColorFoil: Investigating Color Blindness in Large Vision and Language Models

Ahnaf Mozib Samin,M. Firoz Ahmed,Md. Mushtaq Shahriyar Rafee

With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models' perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually distinct to humans with normal color perception ability.

多峰值 · 混合專家模型 · MoDELS · 縮放 · 泛化理論 ·

2024 年 5 月 18 日

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Yunxin Li,Shenyuan Jiang,Baotian Hu,Longyue Wang,Wanqi Zhong,Wenhan Luo,Lin Ma,Min Zhang

from arxiv, 22 pages, 13 figures. Project Website: //uni-moe.github.io/. Working in progress

Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to efficiently scale large language and image-text models, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts' preferences, and 3) Tuning the Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization. Our findings highlight the substantial potential of MoE frameworks in advancing MLLMs and the code is available at //github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.

API · 小樣本學習 · 異常檢測 · Learning · 可辨認的 ·

2024 年 5 月 18 日

Few-Shot API Attack Detection: Overcoming Data Scarcity with GAN-Inspired Learning

Udi Aharon,Revital Marbel,Ran Dubin,Amit Dvir,Chen Hajaj

from arxiv, 8 pages, 2 figures, 7 tables

Web applications and APIs face constant threats from malicious actors seeking to exploit vulnerabilities for illicit gains. These threats necessitate robust anomaly detection systems capable of identifying malicious API traffic efficiently despite limited and diverse datasets. This paper proposes a novel few-shot detection approach motivated by Natural Language Processing (NLP) and advanced Generative Adversarial Network (GAN)-inspired techniques. Leveraging state-of-the-art Transformer architectures, particularly RoBERTa, our method enhances the contextual understanding of API requests, leading to improved anomaly detection compared to traditional methods. We showcase the technique's versatility by demonstrating its effectiveness with both Out-of-Distribution (OOD) and Transformer-based binary classification methods on two distinct datasets: CSIC 2010 and ATRDF 2023. Our evaluations reveal consistently enhanced or, at worst, equivalent detection rates across various metrics in most vectors, highlighting the promise of our approach for improving API security.

SLAM · 環 · Performance · state-of-the-art · Integration ·

2024 年 5 月 18 日

NGM-SLAM: Gaussian Splatting SLAM with Radiance Field Submap

Mingrui Li,Jingwei Huang,Lei Sun,Aaron Xuxiang Tian,Tianchen Deng,Hongyu Wang

from arxiv, 9pages, 4 figures

Gaussian Splatting has garnered widespread attention due to its exceptional performance. Consequently, SLAM systems based on Gaussian Splatting have emerged, leveraging its capabilities for rapid real-time rendering and high-fidelity mapping. However, current Gaussian Splatting SLAM systems usually struggle with large scene representation and lack effective loop closure adjustments and scene generalization capabilities. To address these issues, we introduce NGM-SLAM, the first GS-SLAM system that utilizes neural radiance field submaps for progressive scene expression, effectively integrating the strengths of neural radiance fields and 3D Gaussian Splatting. We have developed neural implicit submaps as supervision and achieve high-quality scene expression and online loop closure adjustments through Gaussian rendering of fused submaps. Our results on multiple real-world scenes and large-scale scene datasets demonstrate that our method can achieve accurate gap filling and high-quality scene expression, supporting both monocular, stereo, and RGB-D inputs, and achieving state-of-the-art scene reconstruction and tracking performance.

值域 · 機器人 · Extensibility · 操作 · TOOLS ·

2024 年 5 月 17 日

YORI: Autonomous Cooking System Utilizing a Modular Robotic Kitchen and a Dual-Arm Proprioceptive Manipulator

Donghun Noh,Hyunwoo Nam,Kyle Gillespie,Yeting Liu,Dennis Hong

from arxiv, This manuscript is 13 pages long, includes 10 figures, and cites 20 references. It is to be submitted

This article introduces the development and implementation of the Yummy Operations Robot Initiative (YORI), an innovative, autonomous robotic cooking system. YORI marks a major advancement in culinary automation, adept at handling a diverse range of cooking tasks, capable of preparing multiple dishes simultaneously, and offering the flexibility to adapt to an extensive array of culinary activities. This versatility is achieved through the use of custom tools and appliances operated by a dual arm manipulator utilizing proprioceptive actuators. The use of proprioceptive actuators enables fast yet precise movements, while allowing for accurate force control and effectively mitigating the inevitable impacts encountered in cooking. These factors underscore this technology's boundless potential. A key to YORI's adaptability is its modular kitchen design, which allows for easy adaptations to accommodate a continuously increasing range of culinary tasks. This article provides a comprehensive look at YORI's design process, and highlights its role in revolutionizing the culinary world by enhancing efficiency, consistency, and versatility in food preparation.

可理解性 · 多峰值 · MoDELS · Extensibility · Performer ·

2020 年 2 月 15 日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo,Lei Ji,Botian Shi,Haoyang Huang,Nan Duan,Tianrui Li,Xilin Chen,Ming Zhou

We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.