亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.

相關內容

ACM/IEEE第23屆模型驅動工程語言和系統國際會議,是模型驅動軟件和系統工程的首要會議系列,由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來,模型涵蓋了建模的各個方面,從語言和方法到工具和應用程序。模特的參加者來自不同的背景,包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇,參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會,并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。 官網鏈接: · Guidance · 控制器 · state-of-the-art · HTTPS ·
2024 年 11 月 4 日

We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: //johannakarras.github.io/Fashion-VDM.

Community detection plays a pivotal role in uncovering closely connected subgraphs, aiding various real-world applications such as recommendation systems and anomaly detection. With the surge of rich information available for entities in real-world networks, the community detection problem in attributed networks has attracted widespread attention. While previous research has effectively leveraged network topology and attribute information for attributed community detection, these methods overlook two critical issues: (i) the semantic similarity between node attributes within the community, and (ii) the inherent mesoscopic structure, which differs from the pairwise connections of the micro-structure. To address these limitations, we propose HACD, a novel attributed community detection model based on heterogeneous graph attention networks. HACD treats node attributes as another type of node, constructs attributed networks into heterogeneous graph structures and employs attribute-level attention mechanisms to capture semantic similarity. Furthermore, HACD introduces a community membership function to explore mesoscopic community structures, enhancing the robustness of detected communities. Extensive experiments demonstrate the effectiveness and efficiency of HACD, outperforming state-of-the-art methods in attributed community detection tasks. Our code is publicly available at //github.com/Anniran1/HACD1-wsdm.

In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods. Furthermore, AnyV2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks.

Integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation significantly facilitates users' interaction as well as improves interaction efficiency. However, existing studies primarily encode the position or pixel regions of prompts without considering the contextual areas around them, resulting in insufficient prompt feedback, which is not conducive to performance acceleration. To tackle this problem, this paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting. Specifically, we first propose a Probabilistic Prompt-unified Encoder (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt contextual information, offering richer feedback cues to accelerate performance improvement. On this basis, we further present a Prompt-to-Pixel Contrastive (P$^2$C) loss to accurately align both prompt and pixel features, bridging the representation gap between them to offer consistent feature representations for mask prediction. Moreover, our approach designs a Dual-cross Merging Attention (DMA) module to implement bidirectional feature interaction between image and prompt features, generating notable features for performance improvement. A comprehensive variety of experiments on several challenging datasets demonstrates that the proposed components achieve consistent improvements, yielding state-of-the-art interactive segmentation performance. Our code is available at //github.com/XuZhang1211/PVPUFormer.

Online video platforms have gained increased popularity due to their ability to support information consumption and sharing and the diverse social interactions they afford. Danmaku, a real-time commentary feature that overlays user comments on a video, has been found to improve user engagement, however, the use of Danmaku can lead to toxic behaviors and inappropriate comments. To address these issues, we propose a proactive moderation approach inspired by Impact Captions, a visual technique used in East Asian variety shows. Impact Captions combine textual content and visual elements to construct emotional and cognitive resonance. Within the context of this work, Impact Captions were used to guide viewers towards positive Danmaku-related activities and elicit more pro-social behaviors. Leveraging Impact Captions, we developed DanModCap, an moderation tool that collected and analyzed Danmaku and used it as input to large generative language models to produce Impact Captions. Our evaluation of DanModCap demonstrated that Impact Captions reduced negative antagonistic emotions, increased users' desire to share positive content, and elicited self-control in Danmaku social action to fostering proactive community maintenance behaviors. Our approach highlights the benefits of using LLM-supported content moderation methods for proactive moderation in a large-scale live content contexts.

Adapting pretrained image-based diffusion models to generate temporally consistent videos has become an impactful generative modeling research direction. Training-free noise-space manipulation has proven to be an effective technique, where the challenge is to preserve the Gaussian white noise distribution while adding in temporal consistency. Recently, Chang et al. (2024) formulated this problem using an integral noise representation with distribution-preserving guarantees, and proposed an upsampling-based algorithm to compute it. However, while their mathematical formulation is advantageous, the algorithm incurs a high computational cost. Through analyzing the limiting-case behavior of their algorithm as the upsampling resolution goes to infinity, we develop an alternative algorithm that, by gathering increments of multiple Brownian bridges, achieves their infinite-resolution accuracy while simultaneously reducing the computational cost by orders of magnitude. We prove and experimentally validate our theoretical claims, and demonstrate our method's effectiveness in real-world applications. We further show that our method readily extends to the 3-dimensional space.

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, improving the long video captioning score from 2.00 to 3.26 (out of 5), achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on the VideoMME benchmark, i.e., 61.8% with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.

Holographic displays hold the promise of providing authentic depth cues, resulting in enhanced immersive visual experiences for near-eye applications. However, current holographic displays are hindered by speckle noise, which limits accurate reproduction of color and texture in displayed images. We present HoloChrome, a polychromatic holographic display framework designed to mitigate these limitations. HoloChrome utilizes an ultrafast, wavelength-adjustable laser and a dual-Spatial Light Modulator (SLM) architecture, enabling the multiplexing of a large set of discrete wavelengths across the visible spectrum. By leveraging spatial separation in our dual-SLM setup, we independently manipulate speckle patterns across multiple wavelengths. This novel approach effectively reduces speckle noise through incoherent averaging achieved by wavelength multiplexing. Our method is complementary to existing speckle reduction techniques, offering a new pathway to address this challenge. Furthermore, the use of polychromatic illumination broadens the achievable color gamut compared to traditional three-color primary holographic displays. Our simulations and tabletop experiments validate that HoloChrome significantly reduces speckle noise and expands the color gamut. These advancements enhance the performance of holographic near-eye displays, moving us closer to practical, immersive next-generation visual experiences.

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.

北京阿比特科技有限公司