亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

相關運算在視覺目標跟蹤領域中發揮了重要作用,相關運算通過簡單的相似性比較,來完成模板特征和搜索區域特征的交互,輸出相似度圖。然而,相關運算本身是一個局部的線性匹配,導致了語義信息的丟失和全局信息的缺乏。針對相關運算的局限性,本工作提出了基于Transformer的特征融合模型,通過建立非線性語義融合和挖掘遠距離特征關聯有效聚合目標和搜索區域的全局信息,顯著提升了算法的精準度。TransT在多個跟蹤數據集上達到目前最先進的性能,速度可達50 fps。

//www.zhuanzhi.ai/paper/7dc7d2e7e635f18776db3f04e7c58bbb

付費5元查看完整內容

相關內容

本文是第一個將Transformers應用于視頻分割領域的方法。視頻實例分割指的是同時對視頻中感興趣的物體進行分類,分割和跟蹤的任務。現有的方法通常設計復雜的流程來解決此問題。本文提出了一種基于Transformers的視頻實例分割新框架VisTR,該框架將視頻實例分割任務視為直接端到端的并行序列解碼和預測的問題。給定一個含有多幀圖像的視頻作為輸入,VisTR直接按順序輸出視頻中每個實例的掩碼序列。該方法的核心是一種新的實例序列匹配和分割的策略,該策略在整個序列級別上對實例進行監督和分割。VisTR將實例分割和跟蹤統一到了相似度學習的框架下,從而大大簡化了流程。在沒有任何trick的情況下,VisTR在所有使用單一模型的方法中獲得了最佳效果,并且在YouTube-VIS數據集上實現了最快的速度。

//www.zhuanzhi.ai/paper/0dfba6abdc5e6a189d86770822c17859

付費5元查看完整內容

Correlation acts as a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion manner to consider the similarity between the template and the search region. However, the correlation operation itself is a local linear matching process, leading to lose semantic information and fall into local optimum easily, which may be the bottleneck of designing high-accuracy tracking algorithms. Is there any better feature fusion method than correlation? To address this issue, inspired by Transformer, this work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. Finally, we present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Experiments show that our TransT achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks. Our tracker runs at approximatively 50 fps on GPU. Code and models are available at //github.com/chenxin-dlut/TransT.

本文提出一種Transformer輔助跟蹤框架,可與判別式跟蹤器結合(如組成:TrDiMP),表現SOTA!性能優于SiamRPN++等,代碼剛剛開源!

在視頻目標跟蹤中,連續幀之間存在豐富的時間上下文,在現有的跟蹤器中已大大忽略了這些上下文。在這項工作中,我們橋接單個視頻幀,并通過一個用于穩固對象跟蹤的Transformer架構探索跨它們的時間上下文。與在自然語言處理任務中使用轉換器的經典用法不同,我們將其編碼器和解碼器分為兩個并行分支,并在類似于Siamese的跟蹤pipeline中精心設計它們。Transformer編碼器通過基于注意力的特征增強來促進目標模板,這有利于高質量跟蹤模型的生成。Transformer解碼器將跟蹤提示從先前的模板傳播到當前幀,從而簡化了對象搜索過程。我們的Transformer輔助跟蹤框架整潔并以端到端的方式進行了訓練。使用提出的Transformer,一種簡單的連體匹配方法就可以勝過當前表現最佳的跟蹤器。通過將我們的Transformer與最新的判別式跟蹤pipeline相結合,我們的方法在流行的跟蹤基準上創下了一些新的最新記錄。

//www.zhuanzhi.ai/paper/c862787c6e21054a17ed51c178372f5e

付費5元查看完整內容

最近提出的DETR,以消除在目標檢測中許多手工設計的組件的需要,同時顯示良好的性能。但由于Transformer注意模塊在處理圖像特征圖時的局限性,導致收斂速度慢,特征空間分辨率有限。為了減輕這些問題,我們提出了可變形的DETR,其注意力模塊只關注參考點周圍的一小組關鍵采樣點。可變形的DETR比DETR(特別是在小物體上)可以獲得更好的性能,訓練周期少10個。在COCO數據集上的大量實驗證明了我們的方法的有效性。

付費5元查看完整內容

檢索與自然語言查詢相關的視頻內容對有效處理互聯網規模的數據集起著至關重要的作用。大多數現有的字幕-視頻檢索方法都沒有充分利用視頻中的跨模態線索。此外,他們聚合每幀的視覺特征與有限的或沒有時間信息。在本文中,我們提出了一種多模態Transformer聯合編碼視頻中不同的模態,使每一個模態關注其他模態。transformer架構還被用于對時態信息進行編碼和建模。在自然語言方面,我們研究了聯合優化嵌入在多模態轉換器中的語言的最佳實踐。這個新的框架允許我們建立最先進的視頻檢索結果在三個數據集。更多詳情請訪問//thoth.inrialpes.fr/research/MMT。

付費5元查看完整內容

Discrete correlation filter (DCF) based trackers have shown considerable success in visual object tracking. These trackers often make use of low to mid level features such as histogram of gradients (HoG) and mid-layer activations from convolution neural networks (CNNs). We argue that including semantically higher level information to the tracked features may provide further robustness to challenging cases such as viewpoint changes. Deep salient object detection is one example of such high level features, as it make use of semantic information to highlight the important regions in the given scene. In this work, we propose an improvement over DCF based trackers by combining saliency based and other features based filter responses. This combination is performed with an adaptive weight on the saliency based filter responses, which is automatically selected according to the temporal consistency of visual saliency. We show that our method consistently improves a baseline DCF based tracker especially in challenging cases and performs superior to the state-of-the-art. Our improved tracker operates at 9.3 fps, introducing a small computational burden over the baseline which operates at 11 fps.

北京阿比特科技有限公司