亚洲精品无码黄色网站在线观看,国产污片在线观看网站,中文字幕一区二区三区视频不卡,秋霞伦理在线观看

Current slot-oriented approaches for compositional scene segmentation from images and videos rely on provided background information or slot assignments. We present a segmented location and identity tracking system, Loci-Segmented (Loci-s), which does not require either of this information. It learns to dynamically segment scenes into interpretable background and slot-based object encodings, separating rgb, mask, location, and depth information for each. The results reveal largely superior video decomposition performance in the MOVi datasets and in another established dataset collection targeting scene segmentation. The system's well-interpretable, compositional latent encodings may serve as a foundation model for downstream tasks.

相關內容

INFORMS

關注 10

《計算機信息》雜志發表高質量的論文，擴大了運籌學和計算的范圍，尋求有關理論、方法、實驗、系統和應用方面的原創研究論文、新穎的調查和教程論文，以及描述新的和有用的軟件工具的論文。官網鏈接： · 控制器 · Integration · Notability · Extensibility ·

2024 年 3 月 18 日

CRS-Diff: Controllable Generative Remote Sensing Foundation Model

Datao Tang,Xiangyong Cao,Xingsong Hou,Zhongyuan Jiang,Deyu Meng

The emergence of diffusion models has revolutionized the field of image generation, providing new methods for creating high-quality, high-resolution images across various applications. However, the potential of these models for generating domain-specific images, particularly remote sensing (RS) images, remains largely untapped. RS images that are notable for their high resolution, extensive coverage, and rich information content, bring new challenges that general diffusion models may not adequately address. This paper proposes CRS-Diff, a pioneering diffusion modeling framework specifically tailored for generating remote sensing imagery, leveraging the inherent advantages of diffusion models while integrating advanced control mechanisms to ensure that the imagery is not only visually clear but also enriched with geographic and temporal information. The model integrates global and local control inputs, enabling precise combinations of generation conditions to refine the generation process. A comprehensive evaluation of CRS-Diff has demonstrated its superior capability to generate RS imagery both in a single condition and multiple conditions compared with previous methods in terms of image quality and diversity.

知識 (knowledge) · Performer · MoDELS · Better · 控制器 ·

2024 年 3 月 18 日

LogicalDefender: Discovering, Extracting, and Utilizing Common-Sense Knowledge

Yuhe Liu,Mengxue Kang,Zengchang Qin,Xiangxiang Chu

Large text-to-image models have achieved astonishing performance in synthesizing diverse and high-quality images guided by texts. With detail-oriented conditioning control, even finer-grained spatial control can be achieved. However, some generated images still appear unreasonable, even with plentiful object features and a harmonious style. In this paper, we delve into the underlying causes and find that deep-level logical information, serving as common-sense knowledge, plays a significant role in understanding and processing images. Nonetheless, almost all models have neglected the importance of logical relations in images, resulting in poor performance in this aspect. Following this observation, we propose LogicalDefender, which combines images with the logical knowledge already summarized by humans in text. This encourages models to learn logical knowledge faster and better, and concurrently, extracts the widely applicable logical knowledge from both images and human knowledge. Experiments show that our model has achieved better logical performance, and the extracted logical knowledge can be effectively applied to other scenarios.

數據集 · MoDELS · 端到端 · Extensibility · HTTPS ·

2024 年 3 月 18 日

End-To-End Underwater Video Enhancement: Dataset and Model

Dazhao Du,Enhan Li,Lingyu Si,Fanjiang Xu,Jianwei Niu

Underwater video enhancement (UVE) aims to improve the visibility and frame quality of underwater videos, which has significant implications for marine research and exploration. However, existing methods primarily focus on developing image enhancement algorithms to enhance each frame independently. There is a lack of supervised datasets and models specifically tailored for UVE tasks. To fill this gap, we construct the Synthetic Underwater Video Enhancement (SUVE) dataset, comprising 840 diverse underwater-style videos paired with ground-truth reference videos. Based on this dataset, we train a novel underwater video enhancement model, UVENet, which utilizes inter-frame relationships to achieve better enhancement performance. Through extensive experiments on both synthetic and real underwater videos, we demonstrate the effectiveness of our approach. This study represents the first comprehensive exploration of UVE to our knowledge. The code is available at //anonymous.4open.science/r/UVENet.

MoDELS · 3D · 塑造 · 估計/估計量 · 評論員 ·

2024 年 3 月 18 日

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Ruicheng Wang,Jianfeng Xiang,Jiaolong Yang,Xin Tong

from arxiv, Project page: //wangrc.site/DiffCriticEdit/

We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.

損失函數（機器學習） · 泛函 · 優化器 · 損失 · 解碼 ·

2024 年 3 月 17 日

Fidelity-preserving Learning-Based Image Compression: Loss Function and Subjective Evaluation Methodology

Shima Mohammadi,Yaojun Wu,Jo?o Ascenso

from arxiv, 5 pages, 6 figures

Learning-based image compression methods have emerged as state-of-the-art, showcasing higher performance compared to conventional compression solutions. These data-driven approaches aim to learn the parameters of a neural network model through iterative training on large amounts of data. The optimization process typically involves minimizing the distortion between the decoded and the original ground truth images. This paper focuses on perceptual optimization of learning-based image compression solutions and proposes: i) novel loss function to be used during training and ii) novel subjective test methodology that aims to evaluate the decoded image fidelity. According to experimental results from the subjective test taken with the new methodology, the optimization procedure can enhance image quality for low-rates while offering no advantage for high-rates.

Prompt · Better · 變換 · Attention · 推斷 ·

2024 年 3 月 15 日

SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency

Cong Wang,Jinshan Pan,Wanyu Lin,Jiangxin Dong,Xiao-Ming Wu

from arxiv, Accepted by AAAI24. Source codes will be made available at: //github.com/supersupercong/SelfPromer

This work presents an effective depth-consistency self-prompt Transformer for image dehazing. It is motivated by an observation that the estimated depths of an image with haze residuals and its clear counterpart vary. Enforcing the depth consistency of dehazed images with clear ones, therefore, is essential for dehazing. For this purpose, we develop a prompt based on the features of depth differences between the hazy input images and corresponding clear counterparts that can guide dehazing models for better restoration. Specifically, we first apply deep features extracted from the input images to the depth difference features for generating the prompt that contains the haze residual information in the input. Then we propose a prompt embedding module that is designed to perceive the haze residuals, by linearly adding the prompt to the deep features. Further, we develop an effective prompt attention module to pay more attention to haze residuals for better removal. By incorporating the prompt, prompt embedding, and prompt attention into an encoder-decoder network based on VQGAN, we can achieve better perception quality. As the depths of clear images are not available at inference, and the dehazed images with one-time feed-forward execution may still contain a portion of haze residuals, we propose a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation. Extensive experiments show that our method performs favorably against the state-of-the-art approaches on both synthetic and real-world datasets in terms of perception metrics including NIQE, PI, and PIQE.

多峰值 · MoDELS · INFORMS · 可理解性 · Performer ·

2024 年 3 月 15 日

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Yuliang Liu,Biao Yang,Qiang Liu,Zhang Li,Zhiyin Ma,Shuo Zhang,Xiang Bai

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. It also learns to perform screenshot tasks through finetuning. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9\% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding. Code will be released at //github.com/Yuliang-Liu/Monkey.

輸出 · Performer · state-of-the-art · PAR · SimPLe ·

2024 年 3 月 14 日

EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

Nikolai K?rber,Eduard Kromer,Andreas Siebert,Sascha Hauke,Daniel Mueller-Gritschneder,Bj?rn Schuller

from arxiv, revised version

We introduce EGIC, an enhanced generative image compression method that allows traversing the distortion-perception curve efficiently from a single model. EGIC is based on two novel building blocks: i) OASIS-C, a conditional pre-trained semantic segmentation-guided discriminator, which provides both spatially and semantically-aware gradient feedback to the generator, conditioned on the latent image distribution, and ii) Output Residual Prediction (ORP), a retrofit solution for multi-realism image compression that allows control over the synthesis process by adjusting the impact of the residual between an MSE-optimized and GAN-optimized decoder output on the GAN-based reconstruction. Together, EGIC forms a powerful codec, outperforming state-of-the-art diffusion and GAN-based methods (e.g., HiFiC, MS-ILLM, and DIRAC-100), while performing almost on par with VTM-20.0 on the distortion end. EGIC is simple to implement, very lightweight, and provides excellent interpolation characteristics, which makes it a promising candidate for practical applications targeting the low bit range.

優化器 · 圖 · 估計/估計量 · 目標檢測 · 情景 ·

2024 年 3 月 13 日

PRAGO: Differentiable Multi-View Pose Optimization From Objectness Detections

Matteo Taiana,Matteo Toso,Stuart James,Alessio Del Bue

Robustly estimating camera poses from a set of images is a fundamental task which remains challenging for differentiable methods, especially in the case of small and sparse camera pose graphs. To overcome this challenge, we propose Pose-refined Rotation Averaging Graph Optimization (PRAGO). From a set of objectness detections on unordered images, our method reconstructs the rotational pose, and in turn, the absolute pose, in a differentiable manner benefiting from the optimization of a sequence of geometrical tasks. We show how our objectness pose-refinement module in PRAGO is able to refine the inherent ambiguities in pairwise relative pose estimation without removing edges and avoiding making early decisions on the viability of graph edges. PRAGO then refines the absolute rotations through iterative graph construction, reweighting the graph edges to compute the final rotational pose, which can be converted into absolute poses using translation averaging. We show that PRAGO is able to outperform non-differentiable solvers on small and sparse scenes extracted from 7-Scenes achieving a relative improvement of 21% for rotations while achieving similar translation estimates.

entity · 鏈路預測 · Extensibility · 圖 · 知識圖譜 ·

2019 年 3 月 13 日

MMKG: Multi-Modal Knowledge Graphs

Ye Liu,Hui Li,Alberto Garcia-Duran,Mathias Niepert,Daniel Onoro-Rubio,David S. Rosenblum

from arxiv, ESWC 2019

We present MMKG, a collection of three knowledge graphs that contain both numerical features and (links to) images for all entities as well as entity alignments between pairs of KGs. Therefore, multi-relational link prediction and entity matching communities can benefit from this resource. We believe this data set has the potential to facilitate the development of novel multi-modal learning approaches for knowledge graphs.We validate the utility ofMMKG in the sameAs link prediction task with an extensive set of experiments. These experiments show that the task at hand benefits from learning of multiple feature types.