99热日韩这里只有国产中文精品,国产亚洲欧美丝袜在线观看三区,老师好紧好爽搔浪我还要,黄色网视频免费在线,日本免费大黄AAAA片

Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. With vision Transformers, specifically the multi-head self-attention modules, networks can capture long-term dependencies inherently. However, these attention modules normally need to be trained on large datasets, and vision Transformers show inferior performance on small datasets when training from scratch compared with widely dominant backbones like ResNets. Note that the Transformer model was first proposed for natural language processing, which carries denser information than natural images. To boost the performance of vision Transformers on small datasets, this paper proposes to explicitly increase the input information density in the frequency domain. Specifically, we introduce selecting channels by calculating the channel-wise heatmaps in the frequency domain using Discrete Cosine Transform (DCT), reducing the size of input while keeping most information and hence increasing the information density. As a result, 25% fewer channels are kept while better performance is achieved compared with previous work. Extensive experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets, including CIFAR-10/100, SVHN, Flowers-102, and Tiny ImageNet. The accuracy has been boosted up to 17.05% with Swin and Focal Transformers. Codes are available at //github.com/xiangyu8/DenseVT.

相關內容

INFORMS

關注 10

《計算機信息》雜志發表高質量的論文，擴大了運籌學和計算的范圍，尋求有關理論、方法、實驗、系統和應用方面的原創研究論文、新穎的調查和教程論文，以及描述新的和有用的軟件工具的論文。官網鏈接： · 機器人 · 控制器 · 縮放 · 變換 ·

2022 年 12 月 13 日

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan,Noah Brown,Justice Carbajal,Yevgen Chebotar,Joseph Dabis,Chelsea Finn,Keerthana Gopalakrishnan,Karol Hausman,Alex Herzog,Jasmine Hsu,Julian Ibarz,Brian Ichter,Alex Irpan,Tomas Jackson,Sally Jesmonth,Nikhil J Joshi,Ryan Julian,Dmitry Kalashnikov,Yuheng Kuang,Isabel Leal,Kuang-Huei Lee,Sergey Levine,Yao Lu,Utsav Malla,Deeksha Manjunath,Igor Mordatch,Ofir Nachum,Carolina Parada,Jodilyn Peralta,Emily Perez,Karl Pertsch,Jornell Quiambao,Kanishka Rao,Michael Ryoo,Grecia Salazar,Pannag Sanketi,Kevin Sayed,Jaspiar Singh,Sumedh Sontakke,Austin Stone,Clayton Tan,Huong Tran,Vincent Vanhoucke,Steve Vega,Quan Vuong,Fei Xia,Ted Xiao,Peng Xu,Sichun Xu,Tianhe Yu,Brianna Zitkovich

from arxiv, See website at robotics-transformer.github.io

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io

層 · 混合 · INTERACT · MoDELS · 變換 ·

2022 年 12 月 13 日

OAMixer: Object-aware Mixing Layer for Vision Transformers

Hyunwoo Kang,Sangwoo Mo,Jinwoo Shin

from arxiv, CVPR Transformers for Vision Workshop 2022. First two authors contributed equally

Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, alternating classic convolutional networks. While the initial patch-based models (ViTs) treated all patches equally, recent studies reveal that incorporating inductive bias like spatiality benefits the representations. However, most prior works solely focused on the location of patches, overlooking the scene structure of images. Thus, we aim to further guide the interaction of patches using the object information. Specifically, we propose OAMixer (object-aware mixing layer), which calibrates the patch mixing layers of patch-based models based on the object labels. Here, we obtain the object labels in unsupervised or weakly-supervised manners, i.e., no additional human-annotating cost is necessary. Using the object labels, OAMixer computes a reweighting mask with a learnable scale parameter that intensifies the interaction of patches containing similar objects and applies the mask to the patch mixing layers. By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models, including ViTs, MLP-Mixers, and ConvMixers. Moreover, we show that OAMixer enhances various downstream tasks, including large-scale classification, self-supervised learning, and multi-object recognition, verifying the generic applicability of OAMixer

掩碼自編碼MAE · 變換 · MoDELS · 掩碼 · 自編碼器 ·

2022 年 12 月 13 日

Masked autoencoders are effective solution to transformer data-hungry

Jiawei Mao,Honggu Zhou,Xuesong Yin,Yuanqi Chang. Binling Nie. Rui Xu

Vision Transformers (ViTs) outperforms convolutional neural networks (CNNs) in several vision tasks with its global modeling capabilities. However, ViT lacks the inductive bias inherent to convolution making it require a large amount of data for training. This results in ViT not performing as well as CNNs on small datasets like medicine and science. We experimentally found that masked autoencoders (MAE) can make the transformer focus more on the image itself, thus alleviating the data-hungry issue of ViT to some extent. Yet the current MAE model is too complex resulting in over-fitting problems on small datasets. This leads to a gap between MAEs trained on small datasets and advanced CNNs models still. Therefore, we investigated how to reduce the decoder complexity in MAE and found a more suitable architectural configuration for it with small datasets. Besides, we additionally designed a location prediction task and a contrastive learning task to introduce localization and invariance characteristics for MAE. Our contrastive learning task not only enables the model to learn high-level visual information but also allows the training of MAE's class token. This is something that most MAE improvement efforts do not consider. Extensive experiments have shown that our method shows state-of-the-art performance on standard small datasets as well as medical datasets with few samples compared to the current popular masked image modeling (MIM) and vision transformers for small datasets.The code and models are available at //github.com/Talented-Q/SDMAE.

掩碼自編碼MAE · 變換 · MoDELS · 掩碼 · 自編碼器 ·

2022 年 12 月 12 日

Masked autoencoders is an effective solution to transformer data-hungry

Jiawei Mao,Honggu Zhou,Xuesong Yin,Yuanqi Chang. Binling Nie. Rui Xu

異常檢測 · 變換 · Performer · 規范化的 · CASE ·

2022 年 12 月 9 日

ADTR: Anomaly Detection Transformer with Feature Reconstruction

Zhiyuan You,Kai Yang,Wenhan Luo,Lei Cui,Yu Zheng,Xinyi Le

from arxiv, Accepted by ICONIP 2022

Anomaly detection with only prior knowledge from normal samples attracts more attention because of the lack of anomaly samples. Existing CNN-based pixel reconstruction approaches suffer from two concerns. First, the reconstruction source and target are raw pixel values that contain indistinguishable semantic information. Second, CNN tends to reconstruct both normal samples and anomalies well, making them still hard to distinguish. In this paper, we propose Anomaly Detection TRansformer (ADTR) to apply a transformer to reconstruct pre-trained features. The pre-trained features contain distinguishable semantic information. Also, the adoption of transformer limits to reconstruct anomalies well such that anomalies could be detected easily once the reconstruction fails. Moreover, we propose novel loss functions to make our approach compatible with the normal-sample-only case and the anomaly-available case with both image-level and pixel-level labeled anomalies. The performance could be further improved by adding simple synthetic or external irrelevant anomalies. Extensive experiments are conducted on anomaly detection datasets including MVTec-AD and CIFAR-10. Our method achieves superior performance compared with all baselines.

3D · 數據集 · Extensibility · Continuity · Performer ·

2022 年 12 月 9 日

FLAG3D: A 3D Fitness Activity Dataset with Language Instruction

Yansong Tang,Jinpeng Liu,Aoyang Liu,Bin Yang,Wenxun Dai,Yongming Rao,Jiwen Lu,Jie Zhou,Xiu Li

With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. FLAG3D features the following three aspects: 1) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. Extensive experiments and in-depth analysis show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation. Our dataset and source code will be publicly available at //andytang15.github.io/FLAG3D.

變換 · 稀疏 · Extensibility · MoDELS · INFORMS ·

2022 年 12 月 8 日

Transformer Inertial Poser: Real-time Human Motion Reconstruction from Sparse IMUs with Simultaneous Terrain Generation

Yifeng Jiang,Yuting Ye,Deepak Gopinath,Jungdam Won,Alexander W. Winkler,C. Karen Liu

from arxiv, SIGGRAPH Asia 2022. Video: //youtu.be/rXb6SaXsnc0. Code: //github.com/jyf588/transformer-inertial-poser

Real-time human motion reconstruction from a sparse set of (e.g. six) wearable IMUs provides a non-intrusive and economic approach to motion capture. Without the ability to acquire position information directly from IMUs, recent works took data-driven approaches that utilize large human motion datasets to tackle this under-determined problem. Still, challenges remain such as temporal consistency, drifting of global and joint motions, and diverse coverage of motion types on various terrains. We propose a novel method to simultaneously estimate full-body motion and generate plausible visited terrain from only six IMU sensors in real-time. Our method incorporates 1. a conditional Transformer decoder model giving consistent predictions by explicitly reasoning prediction history, 2. a simple yet general learning target named "stationary body points" (SBPs) which can be stably predicted by the Transformer model and utilized by analytical routines to correct joint and global drifting, and 3. an algorithm to generate regularized terrain height maps from noisy SBP predictions which can in turn correct noisy global motion estimation. We evaluate our framework extensively on synthesized and real IMU data, and with real-time live demos, and show superior performance over strong baseline methods.

Networking · MoDELS · 卷積神經網絡 · 變換 · DNN ·

2021 年 8 月 30 日

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Yucheng Zhao,Guangting Wang,Chuanxin Tang,Chong Luo,Wenjun Zeng,Zheng-Jun Zha

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models will be made publicly available.

學成 · 替代損失 · 在線 · Bandits · 賭博機/老虎機 ·

2019 年 12 月 31 日

A Modern Introduction to Online Learning

Francesco Orabona

In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are dealt through convex surrogate losses and through randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. These notes do not require prior knowledge of convex analysis and all the required mathematical tools are rigorously explained. Moreover, all the proofs have been carefully chosen to be as simple and as short as possible.

INTERACT · 鏈路預測 · entity · Extensibility · 圖 ·

2019 年 11 月 1 日

InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions

Shikhar Vashishth,Soumya Sanyal,Vikram Nitin,Nilesh Agrawal,Partha Talukdar

from arxiv, 11 pages

Most existing knowledge graphs suffer from incompleteness, which can be alleviated by inferring missing links based on known facts. One popular way to accomplish this is to generate low-dimensional embeddings of entities and relations, and use these to make inferences. ConvE, a recently proposed approach, applies convolutional filters on 2D reshapings of entity and relation embeddings in order to capture rich interactions between their components. However, the number of interactions that ConvE can capture is limited. In this paper, we analyze how increasing the number of these interactions affects link prediction performance, and utilize our observations to propose InteractE. InteractE is based on three key ideas -- feature permutation, a novel feature reshaping, and circular convolution. Through extensive experiments, we find that InteractE outperforms state-of-the-art convolutional link prediction baselines on FB15k-237. Further, InteractE achieves an MRR score that is 9%, 7.5%, and 23% better than ConvE on the FB15k-237, WN18RR and YAGO3-10 datasets respectively. The results validate our central hypothesis -- that increasing feature interaction is beneficial to link prediction performance. We make the source code of InteractE available to encourage reproducible research.