露脸视频一区二区三区在线播放_国产污片在线观看网站_黄片视频一区二区三区_日韩精品一区二区新区乱码_黄色免费网站免费在线观看_人妻熟妇无码AV一二三区_国产极品OL丝袜高跟在线观看

Deformable image registration is crucial for aligning medical images in a non-linear fashion across different modalities, allowing for precise spatial correspondence between varying anatomical structures. This paper presents NestedMorph, a novel network utilizing a Nested Attention Fusion approach to improve intra-subject deformable registration between T1-weighted (T1w) MRI and diffusion MRI (dMRI) data. NestedMorph integrates high-resolution spatial details from an encoder with semantic information from a decoder using a multi-scale framework, enhancing both local and global feature extraction. Our model notably outperforms existing methods, including CNN-based approaches like VoxelMorph, MIDIR, and CycleMorph, as well as Transformer-based models such as TransMorph and ViT-V-Net, and traditional techniques like NiftyReg and SyN. Evaluations on the HCP dataset demonstrate that NestedMorph achieves superior performance across key metrics, including SSIM, HD95, and SDlogJ, with the highest SSIM of 0.89, and the lowest HD95 of 2.5 and SDlogJ of 0.22. These results highlight NestedMorph's ability to capture both local and global image features effectively, leading to superior registration performance. The promising outcomes of this study underscore NestedMorph's potential to significantly advance deformable medical image registration, providing a robust framework for future research and clinical applications. The source code and our implementation are available at: //bit.ly/3zdVqcg

相關內容

圖像配準

關注 810

圖(tu)像(xiang)(xiang)(xiang)(xiang)配準是圖(tu)像(xiang)(xiang)(xiang)(xiang)處(chu)理(li)研究(jiu)領(ling)域中(zhong)的(de)(de)(de)(de)(de)一(yi)個典型問題(ti)(ti)和技術(shu)(shu)難點，其目的(de)(de)(de)(de)(de)在于(yu)比(bi)較(jiao)或融(rong)合(he)針對(dui)同(tong)(tong)一(yi)對(dui)象在不(bu)(bu)同(tong)(tong)條(tiao)件下(xia)獲取(qu)的(de)(de)(de)(de)(de)圖(tu)像(xiang)(xiang)(xiang)(xiang)，例如圖(tu)像(xiang)(xiang)(xiang)(xiang)會來(lai)自(zi)不(bu)(bu)同(tong)(tong)的(de)(de)(de)(de)(de)采集設備(bei)，取(qu)自(zi)不(bu)(bu)同(tong)(tong)的(de)(de)(de)(de)(de)時間(jian)(jian)，不(bu)(bu)同(tong)(tong)的(de)(de)(de)(de)(de)拍攝視(shi)角等(deng)等(deng)，有(you)時也(ye)需(xu)要用(yong)到(dao)(dao)針對(dui)不(bu)(bu)同(tong)(tong)對(dui)象的(de)(de)(de)(de)(de)圖(tu)像(xiang)(xiang)(xiang)(xiang)配準問題(ti)(ti)。具(ju)體地說，對(dui)于(yu)一(yi)組圖(tu)像(xiang)(xiang)(xiang)(xiang)數據集中(zhong)的(de)(de)(de)(de)(de)兩幅(fu)圖(tu)像(xiang)(xiang)(xiang)(xiang)，通(tong)過尋找一(yi)種空(kong)間(jian)(jian)變換把(ba)一(yi)幅(fu)圖(tu)像(xiang)(xiang)(xiang)(xiang)映(ying)射到(dao)(dao)另一(yi)幅(fu)圖(tu)像(xiang)(xiang)(xiang)(xiang)，使得(de)兩圖(tu)中(zhong)對(dui)應于(yu)空(kong)間(jian)(jian)同(tong)(tong)一(yi)位置的(de)(de)(de)(de)(de)點一(yi)一(yi)對(dui)應起來(lai)，從而達到(dao)(dao)信(xin)息融(rong)合(he)的(de)(de)(de)(de)(de)目的(de)(de)(de)(de)(de)。該技術(shu)(shu)在計算機視(shi)覺、醫學(xue)圖(tu)像(xiang)(xiang)(xiang)(xiang)處(chu)理(li)以及(ji)材料力(li)學(xue)等(deng)領(ling)域都具(ju)有(you)廣泛的(de)(de)(de)(de)(de)應用(yong)。根據具(ju)體應用(yong)的(de)(de)(de)(de)(de)不(bu)(bu)同(tong)(tong)，有(you)的(de)(de)(de)(de)(de)側重(zhong)于(yu)通(tong)過變換結果融(rong)合(he)兩幅(fu)圖(tu)像(xiang)(xiang)(xiang)(xiang)，有(you)的(de)(de)(de)(de)(de)側重(zhong)于(yu)研究(jiu)變換本身(shen)以獲得(de)對(dui)象的(de)(de)(de)(de)(de)一(yi)些力(li)學(xue)屬性。

潛在 · Processing（編程語言） · 粵港澳大灣區數字經濟研究院 · 分離的 · motivation ·

2024 年 11 月 12 日

LEO: Generative Latent Image Animator for Human Video Synthesis

Yaohui Wang,Xin Ma,Xinyuan Chen,Cunjian Chen,Antitza Dantcheva,Bo Dai,Yu Qiao

from arxiv, IJCV 2024, Project webpage: //wyhsirius.github.io/LEO-project/

Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing.

MoDELS · 原點 · Integration · Extensibility · INFORMS ·

2024 年 11 月 12 日

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Yoad Tewel,Rinon Gal,Dvir Samuel,Yuval Atzmon,Lior Wolf,Gal Chechik

from arxiv, Project page is at //research.nvidia.com/labs/par/addit/

Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.

估計/估計量 · 多峰值 · 語言模型化 · MoDELS · 優化器 ·

2024 年 11 月 11 日

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

Junho Kim,Hyungjin Chung,Byung-Hoon Kim

Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated. However, the optimal use of textual descriptions for keypoints remains an underexplored area. In this work, we introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. We conduct extensive experiments to systematically explore the design space of LLM-based CAPE, investigating factors such as choosing the optimal description for keypoints, neural network architectures, and training strategies. Thanks to the advanced reasoning capabilities of the pre-trained MLLM, CapeLLM demonstrates superior generalization and robust performance. Our approach sets a new state-of-the-art on the MP-100 benchmark in the challenging 1-shot setting, marking a significant advancement in the field of category-agnostic pose estimation.

Performer · INFORMS · 自動問答 · 大語言模型 · 多樣性 ·

2024 年 11 月 11 日

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

Sunjun Kweon,Jiyoun Kim,Heeyoung Kwak,Dongchul Cha,Hangyul Yoon,Kwanghyun Kim,Jeewon Yang,Seunghyun Won,Edward Choi

from arxiv, NeurIPS 2024 (Datasets and Benchmarks)

Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries. We offer EHRNoteQA in two formats: open-ended and multi-choice question answering, and propose a reliable evaluation method for each. We evaluate 27 LLMs using EHRNoteQA and examine various factors affecting the model performance (e.g., the length and number of discharge summaries). Furthermore, to validate EHRNoteQA as a reliable proxy for expert evaluations in clinical practice, we measure the correlation between the LLM performance on EHRNoteQA, and the LLM performance manually evaluated by clinicians. Results show that LLM performance on EHRNoteQA have higher correlation with clinician-evaluated performance (Spearman: 0.78, Kendall: 0.62) compared to other benchmarks, demonstrating its practical relevance in evaluating LLMs in clinical settings.

entity · 未標記 · Learning · CASES · 實體對齊 ·

2024 年 11 月 9 日

Lambda: Learning Matchable Prior For Entity Alignment with Unlabeled Dangling Cases

Hang Yin,Liyao Xiang,Dong Ding,Yuheng He,Yihan Wu,Xinbing Wang,Chenghu Zhou

from arxiv, Accepted in NeurIPS 2024 as a poster

We investigate the entity alignment (EA) problem with unlabeled dangling cases, meaning that partial entities have no counterparts in the other knowledge graph (KG), and this type of entity remains unlabeled. To address this challenge, we propose the framework \textit{Lambda} for dangling detection and then entity alignment. Lambda features a GNN-based encoder called KEESA with spectral contrastive learning for EA and a positive-unlabeled learning algorithm for dangling detection called iPULE. iPULE offers theoretical guarantees of unbiasedness, uniform deviation bounds, and convergence. Experimental results demonstrate that each component contributes to overall performances that are superior to baselines, even when baselines additionally exploit 30\% of dangling entities labeled for training.

MoDELS · 語言模型化 · Performer · Learning · 數據集 ·

2024 年 11 月 8 日

NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts

Yen-Ting Lin,Chao-Han Huck Yang,Zhehuai Chen,Piotr Zelasko,Xuesong Yang,Zih-Ching Chen,Krishna C Puvvada,Szu-Wei Fu,Ke Hu,Jun Wei Chiu,Jagadeesh Balam,Boris Ginsburg,Yu-Chiang Frank Wang

from arxiv, NeKo work has been done in June 2024. NeKo LMs will be open source on //huggingface.co/nvidia under the MIT license

Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert'' of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset's tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative $5.0$% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with $15.5$% to $27.6$% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.

MoDELS · 相似度 · 圖像字幕 · 得分 · 代價 ·

2024 年 11 月 8 日

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

Jia-Hong Huang,Hongyi Zhu,Yixian Shen,Stevan Rudinac,Evangelos Kanoulas

from arxiv, arXiv admin note: substantial text overlap with arXiv:2408.01723

Evaluating the quality of automatically generated image descriptions is a complex task that requires metrics capturing various dimensions, such as grammaticality, coverage, accuracy, and truthfulness. Although human evaluation provides valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr attempt to fill this gap, but they often exhibit weak correlations with human judgment. To address this challenge, we propose a novel evaluation framework called Image2Text2Image, which leverages diffusion models, such as Stable Diffusion or DALL-E, for text-to-image generation. In the Image2Text2Image framework, an input image is first processed by a selected image captioning model, chosen for evaluation, to generate a textual description. Using this generated description, a diffusion model then creates a new image. By comparing features extracted from the original and generated images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies, revealing potential weaknesses in the model's performance. Notably, our framework does not rely on human-annotated reference captions, making it a valuable tool for assessing image captioning models. Extensive experiments and human evaluations validate the efficacy of our proposed Image2Text2Image evaluation framework. The code and dataset will be published to support further research in the community.

MoDELS · 圖像分割 · Mamba · Vision · UNet ·

2024 年 11 月 8 日

VM-UNet: Vision Mamba UNet for Medical Image Segmentation

Jiacheng Ruan,Jincheng Li,Suncheng Xiang

from arxiv, 9 pages, 5 figures, 6 tables. Work in progress

In the realm of medical image segmentation, both CNN-based and Transformer-based models have been extensively explored. However, CNNs exhibit limitations in long-range modeling capabilities, whereas Transformers are hampered by their quadratic computational complexity. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as a promising approach. They not only excel in modeling long-range interactions but also maintain a linear computational complexity. In this paper, leveraging state space models, we propose a U-shape architecture model for medical image segmentation, named Vision Mamba UNet (VM-UNet). Specifically, the Visual State Space (VSS) block is introduced as the foundation block to capture extensive contextual information, and an asymmetrical encoder-decoder structure is constructed with fewer convolution layers to save calculation cost. We conduct comprehensive experiments on the ISIC17, ISIC18, and Synapse datasets, and the results indicate that VM-UNet performs competitively in medical image segmentation tasks. To our best knowledge, this is the first medical image segmentation model constructed based on the pure SSM-based model. We aim to establish a baseline and provide valuable insights for the future development of more efficient and effective SSM-based segmentation systems. Our code is available at //github.com/JCruan519/VM-UNet.

Learning · 圖像分割 · 深度模型 · 講稿 · 評論員 ·

2022 年 7 月 28 日

Learning with Limited Annotations: A Survey on Deep Semi-Supervised Learning for Medical Image Segmentation

Rushi Jiao,Yichi Zhang,Le Ding,Rong Cai,Jicong Zhang

Medical image segmentation is a fundamental and critical step in many image-guided clinical approaches. Recent success of deep learning-based segmentation methods usually relies on a large amount of labeled data, which is particularly difficult and costly to obtain especially in the medical imaging domain where only experts can provide reliable and accurate annotations. Semi-supervised learning has emerged as an appealing strategy and been widely applied to medical image segmentation tasks to train deep models with limited annotations. In this paper, we present a comprehensive review of recently proposed semi-supervised learning methods for medical image segmentation and summarized both the technical novelties and empirical results. Furthermore, we analyze and discuss the limitations and several unsolved problems of existing approaches. We hope this review could inspire the research community to explore solutions for this challenge and further promote the developments in medical image segmentation field.

圖 · INTERACT · 可理解性 · Extensibility · 學成 ·

2021 年 12 月 16 日

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Zhecan Wang,Haoxuan You,Liunian Harold Li,Alireza Zareian,Suji Park,Yiqing Liang,Kai-Wei Chang,Shih-Fu Chang

from arxiv, AAAI 2022

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.