丰满人妻被公侵犯高清版,久久国产高清最新地址,无翼乌全彩工囗侵犯本子H,亚洲一级黄片A级不卡视频在线,制服诱惑中文字幕一区不卡

Large-scale pre-trained vision models (PVMs) have shown great potential for adaptability across various downstream vision tasks. However, with state-of-the-art PVMs growing to billions or even trillions of parameters, the standard full fine-tuning paradigm is becoming unsustainable due to high computational and storage demands. In response, researchers are exploring parameter-efficient fine-tuning (PEFT), which seeks to exceed the performance of full fine-tuning with minimal parameter modifications. This survey provides a comprehensive overview and future directions for visual PEFT, offering a systematic review of the latest advancements. First, we provide a formal definition of PEFT and discuss model pre-training methods. We then categorize existing methods into three categories: addition-based, partial-based, and unified-based. Finally, we introduce the commonly used datasets and applications and suggest potential future research challenges. A comprehensive collection of resources is available at //github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning.

相關內容

Vision

關注 4

MoDELS · 縮放 · Performer · AIM · 可理解性 ·

2024 年 3 月 21 日

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han,Chao Gao,Jinyang Liu, Jeff, Zhang,Sai Qian Zhang

from arxiv, 25 pages, 13 figures

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities. Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design. In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.

MoDELS · 機器人 · 數據集 · Learning · 分離的 ·

2024 年 3 月 21 日

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration,Abby O'Neill,Abdul Rehman,Abhiram Maddukuri,Abhishek Gupta,Abhishek Padalkar,Abraham Lee,Acorn Pooley,Agrim Gupta,Ajay Mandlekar,Ajinkya Jain,Albert Tung,Alex Bewley,Alex Herzog,Alex Irpan,Alexander Khazatsky,Anant Rai,Anchit Gupta,Andrew Wang,Anikait Singh,Animesh Garg,Aniruddha Kembhavi,Annie Xie,Anthony Brohan,Antonin Raffin,Archit Sharma,Arefeh Yavary,Arhan Jain,Ashwin Balakrishna,Ayzaan Wahid,Ben Burgess-Limerick,Beomjoon Kim,Bernhard Sch?lkopf,Blake Wulfe,Brian Ichter,Cewu Lu,Charles Xu,Charlotte Le,Chelsea Finn,Chen Wang,Chenfeng Xu,Cheng Chi,Chenguang Huang,Christine Chan,Christopher Agia,Chuer Pan,Chuyuan Fu,Coline Devin,Danfei Xu,Daniel Morton,Danny Driess,Daphne Chen,Deepak Pathak,Dhruv Shah,Dieter Büchler,Dinesh Jayaraman,Dmitry Kalashnikov,Dorsa Sadigh,Edward Johns,Ethan Foster,Fangchen Liu,Federico Ceola,Fei Xia,Feiyu Zhao,Freek Stulp,Gaoyue Zhou,Gaurav S. Sukhatme,Gautam Salhotra,Ge Yan,Gilbert Feng,Giulio Schiavi,Glen Berseth,Gregory Kahn,Guanzhi Wang,Hao Su,Hao-Shu Fang,Haochen Shi,Henghui Bao,Heni Ben Amor,Henrik I Christensen,Hiroki Furuta,Homer Walke,Hongjie Fang,Huy Ha,Igor Mordatch,Ilija Radosavovic,Isabel Leal,Jacky Liang,Jad Abou-Chakra,Jaehyung Kim,Jaimyn Drake,Jan Peters,Jan Schneider,Jasmine Hsu,Jeannette Bohg,Jeffrey Bingham,Jeffrey Wu,Jensen Gao,Jiaheng Hu,Jiajun Wu,Jialin Wu,Jiankai Sun,Jianlan Luo,Jiayuan Gu,Jie Tan,Jihoon Oh,Jimmy Wu,Jingpei Lu,Jingyun Yang,Jitendra Malik,Jo?o Silvério,Joey Hejna,Jonathan Booher,Jonathan Tompson,Jonathan Yang,Jordi Salvador,Joseph J. Lim,Junhyek Han,Kaiyuan Wang,Kanishka Rao,Karl Pertsch,Karol Hausman,Keegan Go,Keerthana Gopalakrishnan,Ken Goldberg,Kendra Byrne,Kenneth Oslund,Kento Kawaharazuka,Kevin Black,Kevin Lin,Kevin Zhang,Kiana Ehsani,Kiran Lekkala,Kirsty Ellis,Krishan Rana,Krishnan Srinivasan,Kuan Fang,Kunal Pratap Singh,Kuo-Hao Zeng,Kyle Hatch,Kyle Hsu,Laurent Itti,Lawrence Yunliang Chen,Lerrel Pinto,Li Fei-Fei,Liam Tan,Linxi "Jim" Fan,Lionel Ott,Lisa Lee,Luca Weihs,Magnum Chen,Marion Lepert,Marius Memmel,Masayoshi Tomizuka,Masha Itkina,Mateo Guaman Castro,Max Spero,Maximilian Du,Michael Ahn,Michael C. Yip,Mingtong Zhang,Mingyu Ding,Minho Heo,Mohan Kumar Srirama,Mohit Sharma,Moo Jin Kim,Naoaki Kanazawa,Nicklas Hansen,Nicolas Heess,Nikhil J Joshi,Niko Suenderhauf,Ning Liu,Norman Di Palo,Nur Muhammad Mahi Shafiullah,Oier Mees,Oliver Kroemer,Osbert Bastani,Pannag R Sanketi,Patrick "Tree" Miller,Patrick Yin,Paul Wohlhart,Peng Xu,Peter David Fagan,Peter Mitrano,Pierre Sermanet,Pieter Abbeel,Priya Sundaresan,Qiuyu Chen,Quan Vuong,Rafael Rafailov,Ran Tian,Ria Doshi,Roberto Martín-Martín,Rohan Baijal,Rosario Scalise,Rose Hendrix,Roy Lin,Runjia Qian,Ruohan Zhang,Russell Mendonca,Rutav Shah,Ryan Hoque,Ryan Julian,Samuel Bustamante,Sean Kirmani,Sergey Levine,Shan Lin,Sherry Moore,Shikhar Bahl,Shivin Dass,Shubham Sonawani,Shuran Song,Sichun Xu,Siddhant Haldar,Siddharth Karamcheti,Simeon Adebola,Simon Guist,Soroush Nasiriany,Stefan Schaal,Stefan Welker,Stephen Tian,Subramanian Ramamoorthy,Sudeep Dasari,Suneel Belkhale,Sungjae Park,Suraj Nair,Suvir Mirchandani,Takayuki Osa,Tanmay Gupta,Tatsuya Harada,Tatsuya Matsushima,Ted Xiao,Thomas Kollar,Tianhe Yu,Tianli Ding,Todor Davchev,Tony Z. Zhao,Travis Armstrong,Trevor Darrell,Trinity Chung,Vidhi Jain,Vincent Vanhoucke,Wei Zhan,Wenxuan Zhou,Wolfram Burgard,Xi Chen,Xiaolong Wang,Xinghao Zhu,Xinyang Geng,Xiyuan Liu,Xu Liangwei,Xuanlin Li,Yao Lu,Yecheng Jason Ma,Yejin Kim,Yevgen Chebotar,Yifan Zhou,Yifeng Zhu,Yilin Wu,Ying Xu,Yixuan Wang,Yonatan Bisk,Yoonyoung Cho,Youngwoon Lee,Yuchen Cui,Yue Cao,Yueh-Hua Wu,Yujin Tang,Yuke Zhu,Yunchu Zhang,Yunfan Jiang,Yunshuang Li,Yunzhu Li,Yusuke Iwasawa,Yutaka Matsuo,Zehan Ma,Zhuo Xu,Zichen Jeff Cui,Zichen Zhang,Zipeng Lin

from arxiv, Project website: //robotics-transformer-x.github.io

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website //robotics-transformer-x.github.io.

MoDELS · 正則化項 · Synopsys · Prompt · Attention ·

2024 年 3 月 20 日

VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

Yumeng Li,William Beluch,Margret Keuper,Dan Zhang,Anna Khoreva

from arxiv, Project page: //yumengli007.github.io/VSTAR

Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.

LIDAR · MoDELS · Processing（編程語言） · 3D · 縮放 ·

2024 年 3 月 20 日

Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion

Lucas Nunes,Rodrigo Marcuzzi,Benedikt Mersch,Jens Behley,Cyrill Stachniss

Computer vision techniques play a central role in the perception stack of autonomous vehicles. Such methods are employed to perceive the vehicle surroundings given sensor data. 3D LiDAR sensors are commonly used to collect sparse 3D point clouds from the scene. However, compared to human perception, such systems struggle to deduce the unseen parts of the scene given those sparse point clouds. In this matter, the scene completion task aims at predicting the gaps in the LiDAR measurements to achieve a more complete scene representation. Given the promising results of recent diffusion models as generative models for images, we propose extending them to achieve scene completion from a single 3D LiDAR scan. Previous works used diffusion models over range images extracted from LiDAR data, directly applying image-based diffusion methods. Distinctly, we propose to directly operate on the points, reformulating the noising and denoising diffusion process such that it can efficiently work at scene scale. Together with our approach, we propose a regularization loss to stabilize the noise predicted during the denoising process. Our experimental evaluation shows that our method can complete the scene given a single LiDAR scan as input, producing a scene with more details compared to state-of-the-art scene completion methods. We believe that our proposed diffusion process formulation can support further research in diffusion models applied to scene-scale point cloud data.

Integration · 多峰值 · contrastive · 損失 · MoDELS ·

2024 年 3 月 20 日

MEDBind: Unifying Language and Multimodal Medical Data Embeddings

Yuan Gao,Sangwook Kim,David E Austin,Chris McIntosh

Medical vision-language pretraining models (VLPM) have achieved remarkable progress in fusing chest X-rays (CXR) with clinical texts, introducing image-text data binding approaches that enable zero-shot learning and downstream clinical tasks. However, the current landscape lacks the holistic integration of additional medical modalities, such as electrocardiograms (ECG). We present MEDBind (Medical Electronic patient recorD), which learns joint embeddings across CXR, ECG, and medical text. Using text data as the central anchor, MEDBind features tri-modality binding, delivering competitive performance in top-K retrieval, zero-shot, and few-shot benchmarks against established VLPM, and the ability for CXR-to-ECG zero-shot classification and retrieval. This seamless integration is achieved through combination of contrastive loss on modality-text pairs with our proposed contrastive loss function, Edge-Modality Contrastive Loss, fostering a cohesive embedding space for CXR, ECG, and text. Finally, we demonstrate that MEDBind can improve downstream tasks by directly integrating CXR and ECG embeddings into a large-language model for multimodal prompt tuning.

ALT · MoDELS · Learning · Performer · 獨熱 ·

2024 年 3 月 19 日

Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

Yifei Chen,Dapeng Chen,Ruijin Liu,Sai Zhou,Wenyuan Xue,Wei Peng

from arxiv, Accepted at CVPR 2024

Large-scale visual-language pre-trained models have achieved significant success in various video tasks. However, most existing methods follow an "adapt then align" paradigm, which adapts pre-trained image encoders to model video-level representations and utilizes one-hot or text embedding of the action labels for supervision. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper, we propose a novel "Align before Adapt" (ALT) paradigm. Prior to adapting to video representation learning, we exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus. With the aligned entities, we feed their text embeddings to a transformer-based video adapter as the queries, which can help extract the semantics of the most important entities from a video to a vector. This paradigm reuses the visual-language alignment of VLP during adaptation and tries to explain an action by the underlying entities. This helps understand actions by bridging the gap with complex activity semantics, particularly when facing unfamiliar or unseen categories. ALT demonstrates competitive performance while maintaining remarkably low computational costs. In fully supervised experiments, it achieves 88.1% top-1 accuracy on Kinetics-400 with only 4947 GFLOPs. Moreover, ALT outperforms the previous state-of-the-art methods in both zero-shot and few-shot experiments, emphasizing its superior generalizability across various learning scenarios.

蒸餾 · MoDELS · state-of-the-art · 基 · 模態 ·

2024 年 3 月 19 日

AnimateDiff-Lightning: Cross-Model Diffusion Distillation

Shanchuan Lin,Xiao Yang

We present AnimateDiff-Lightning for lightning-fast video generation. Our model uses progressive adversarial diffusion distillation to achieve new state-of-the-art in few-step video generation. We discuss our modifications to adapt it for the video modality. Furthermore, we propose to simultaneously distill the probability flow of multiple base diffusion models, resulting in a single distilled motion module with broader style compatibility. We are pleased to release our distilled AnimateDiff-Lightning model for the community's use.

MoDELS · Performer · INFORMS · 掩碼 · 語言模型化 ·

2024 年 3 月 18 日

Fisher Mask Nodes for Language Model Merging

Thennal D K,Ganesh Nathan,Suchithra M S

from arxiv, Accepted at LREC-COLING 2024

Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup of 57.4x in the biggest model. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.

Prompt · MoDELS · 學成 · Extensibility · 向量化 ·

2022 年 3 月 10 日

Conditional Prompt Learning for Vision-Language Models

Kaiyang Zhou,Jingkang Yang,Chen Change Loy,Ziwei Liu

from arxiv, CVPR 2022. TL;DR: We propose a conditional prompt learning approach to solve the generalizability issue of static prompts

With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at //github.com/KaiyangZhou/CoOp.

圖 · 圖卷積 · 卷積 · Neural Networks · Networking ·

2018 年 6 月 6 日

Graph Convolutional Neural Networks for Web-Scale Recommender Systems

Rex Ying,Ruining He,Kaifeng Chen,Pong Eksombatchai,William L. Hamilton,Jure Leskovec

from arxiv, KDD 2018

Recent advancements in deep neural networks for graph-structured data have led to state-of-the-art performance on recommender system benchmarks. However, making these methods practical and scalable to web-scale recommendation tasks with billions of items and hundreds of millions of users remains a challenge. Here we describe a large-scale deep recommendation engine that we developed and deployed at Pinterest. We develop a data-efficient Graph Convolutional Network (GCN) algorithm PinSage, which combines efficient random walks and graph convolutions to generate embeddings of nodes (i.e., items) that incorporate both graph structure as well as node feature information. Compared to prior GCN approaches, we develop a novel method based on highly efficient random walks to structure the convolutions and design a novel training strategy that relies on harder-and-harder training examples to improve robustness and convergence of the model. We also develop an efficient MapReduce model inference algorithm to generate embeddings using a trained model. We deploy PinSage at Pinterest and train it on 7.5 billion examples on a graph with 3 billion nodes representing pins and boards, and 18 billion edges. According to offline metrics, user studies and A/B tests, PinSage generates higher-quality recommendations than comparable deep learning and graph-based alternatives. To our knowledge, this is the largest application of deep graph embeddings to date and paves the way for a new generation of web-scale recommender systems based on graph convolutional architectures.