亚洲综合蜜桃久久丁香婷_在线观看WWW日本免费网站_国产精品综合网第一页_女人J部裸露无遮挡免费视频_精品一区精品二区精品三区_久久国内精品自在自线400部_成人免费看片网站

We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing soccer, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 and 9.3 HOTA on the EgoHumans dataset.

相關內容

關注 36

3D是英文“Three Dimensions”的簡稱，中文是指三維、三個維度、三個坐標，即有長、有寬、有高，換句話說，就是立體的，是相對于只有長和寬的平面（2D）而言。

多峰值 · Performer · 蒸餾 · MoDELS · 知識 (knowledge) ·

2023 年 7 月 14 日

Multimodal Distillation for Egocentric Action Recognition

Gorjan Radevski,Dusan Grujicic,Marie-Francine Moens,Matthew Blaschko,Tinne Tuytelaars

from arxiv, Accepted at ICCV 2023

The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well. However, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further adopt a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views.

Analysis · Performer · 可理解性 · 數據集 · 多樣性 ·

2023 年 7 月 13 日

Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events

Weiyao Lin,Huabin Liu,Shizhan Liu,Yuxi Li,Rui Qian,Tao Wang,Ning Xu,Hongkai Xiong,Guo-Jun Qi,Nicu Sebe

from arxiv, Dataset for Large-scale Human-centric Video Analysis in Complex Events (//humaninevents.org), the paper has been published in Int J Comput Vis (2023)

Along with the development of modern smart cities, human-centric video analysis has been encountering the challenge of analyzing diverse and complex events in real scenes. A complex event relates to dense crowds, anomalous individuals, or collective behaviors. However, limited by the scale and coverage of existing video datasets, few human analysis approaches have reported their performances on such complex events. To this end, we present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve (Human-centric video analysis in complex Events), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd & complex events. It contains a record number of poses (>1M), the largest number of action instances (>56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time (with an average trajectory length of >480 frames). Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation, respectively. They leverage cross-label information during training to enhance the feature learning in corresponding visual tasks. Experiments show that they could boost the performance of existing action recognition and pose estimation pipelines. More importantly, they prove the widely ranged annotations in HiEve can improve various video tasks. Furthermore, we conduct extensive experiments to benchmark recent video analysis approaches together with our baseline methods, demonstrating HiEve is a challenging dataset for human-centric video analysis. We expect that the dataset will advance the development of cutting-edge techniques in human-centric analysis and the understanding of complex events. The dataset is available at //humaninevents.org

Networking · 樣本 · 表示 · 示例 · Learning ·

2023 年 7 月 13 日

Free-Form Composition Networks for Egocentric Action Recognition

Haoran Wang,Qinghua Cheng,Baosheng Yu,Yibing Zhan,Dapeng Tao,Liang Ding,Haibin Ling

Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and then use them to compose new samples in the feature space for rare classes of action videos. First, we use a graph to capture the spatial-temporal relations among different hand/object instances in each action video. We thus decompose each action into a set of verb and preposition spatial-temporal representations using the edge features in the graph. The temporal decomposition extracts verb and preposition representations from different video frames, while the spatial decomposition adaptively learns verb and preposition representations from action-related instances in each frame. With these spatial-temporal representations of verbs and prepositions, we can compose new samples for those rare classes in a free-form manner, which is not restricted to a rigid form of a verb and a noun. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance. We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition.

Networking · Performer · RGB-D · Wireless Networks · 覆蓋 ·

2023 年 7 月 12 日

A Dynamic Obstacle Tracking Strategy for Proactive Handoffs in Millimeter-wave Networks

Rathindra Nath Dutta,Subhojit Sarkar,Sasthi C. Ghosh

Stringent line-of-sight demands necessitated by the fast attenuating nature of millimeter waves (mmWaves) through obstacles pose one of the central problems of next generation wireless networks. These mmWave links are easily disrupted due to obstacles, including vehicles and pedestrians, which cause degradation in link quality and even link failure. Dynamic obstacles are usually tracked by dedicated tracking hardware like RGB-D cameras, which usually have small ranges, and hence lead to prohibitively increased deployment costs to achieve complete coverage of the deployment area. In this manuscript, we propose an altogether different approach to track multiple dynamic obstacles in an mmWave network, solely based on short-term historical link failure information, without resorting to any dedicated tracking hardware. After proving that the said problem is NP-complete, we employ a greedy set-cover based approach to solve it. Using the obtained trajectories, we perform proactive handoffs for at-risk links. We compare our approach with an RGB-D camera-based approach and show that our approach provides better tracking and handoff performances when the camera coverage is low to moderate, which is often the case in real deployment scenarios.

視覺問答 · Automator · Performer · MoDELS · state-of-the-art ·

2023 年 7 月 11 日

Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting

Chantal Pellegrini,Matthias Keicher,Ege ?zsoy,Nassir Navab

from arxiv, accepted at MICCAI 2023

Radiology reporting is a crucial part of the communication between radiologists and other medical professionals, but it can be time-consuming and error-prone. One approach to alleviate this is structured reporting, which saves time and enables a more accurate evaluation than free-text reports. However, there is limited research on automating structured reporting, and no public benchmark is available for evaluating and comparing different methods. To close this gap, we introduce Rad-ReStruct, a new benchmark dataset that provides fine-grained, hierarchically ordered annotations in the form of structured reports for X-Ray images. We model the structured reporting task as hierarchical visual question answering (VQA) and propose hi-VQA, a novel method that considers prior context in the form of previously asked questions and answers for populating a structured radiology report. Our experiments show that hi-VQA achieves competitive performance to the state-of-the-art on the medical VQA benchmark VQARad while performing best among methods without domain-specific vision-language pretraining and provides a strong baseline on Rad-ReStruct. Our work represents a significant step towards the automated population of structured radiology reports and provides a valuable first benchmark for future research in this area. We will make all annotations and our code for annotation generation, model evaluation, and training publicly available upon acceptance. Our dataset and code is available at //github.com/ChantalMP/Rad-ReStruct.

contrastive · 對比學習 · 學成 · Extensibility · 稀疏 ·

2022 年 3 月 25 日

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Fangzhou Hong,Liang Pan,Zhongang Cai,Ziwei Liu

from arxiv, CVPR 2022; Project Page //hongfz16.github.io/projects/HCMoCo.html; Codes available at //github.com/hongfz16/HCMoCo

Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.

3D · Continuity · 估計/估計量 · MoDELS · 正則化項 ·

2022 年 3 月 8 日

Recovering 3D Human Mesh from Monocular Images: A Survey

Yating Tian,Hongwen Zhang,Yebin Liu,Limin Wang

from arxiv, Survey paper on monocular 3D human mesh recovery, Project page: //github.com/tinatiansjz/hmr-survey

Estimating human pose and shape from monocular images is a long-standing problem in computer vision. Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention. With the same goal of obtaining well-aligned and physically plausible mesh results, two paradigms have been developed to overcome challenges in the 2D-to-3D lifting process: i) an optimization-based paradigm, where different data terms and regularization terms are exploited as optimization objectives; and ii) a regression-based paradigm, where deep learning techniques are embraced to solve the problem in an end-to-end fashion. Meanwhile, continuous efforts are devoted to improving the quality of 3D mesh labels for a wide range of datasets. Though remarkable progress has been achieved in the past decade, the task is still challenging due to flexible body motions, diverse appearances, complex environments, and insufficient in-the-wild annotations. To the best of our knowledge, this is the first survey to focus on the task of monocular 3D human mesh recovery. We start with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses. We also summarize datasets, evaluation metrics, and benchmark results. Open issues and future directions are discussed in the end, hoping to motivate researchers and facilitate their research in this area. A regularly updated project page can be found at //github.com/tinatiansjz/hmr-survey.

估計/估計量 · Performer · 3D · 數據集 · HTTPS ·

2020 年 12 月 24 日

Deep Learning-Based Human Pose Estimation: A Survey

Ce Zheng,Wenhan Wu,Taojiannan Yang,Sijie Zhu,Chen Chen,Ruixu Liu,Ju Shen,Nasser Kehtarnavaz,Mubarak Shah

Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusions. The goal of this survey paper is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 240 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. We also provide a regularly updated project page on: \url{//github.com/zczcwh/DL-HPE}

MoDELS · 圖卷積神經網絡/圖卷積網絡 · 圖 · 圖卷積 · Networking ·

2020 年 12 月 14 日

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Dong Wang,Di Hu,Xingjian Li,Dejing Dou

from arxiv, Accepted by the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Temporal relational modeling in video is essential for human action understanding, such as action recognition and action segmentation. Although Graph Convolution Networks (GCNs) have shown promising advantages in relation reasoning on many tasks, it is still a challenge to apply graph convolution networks on long video sequences effectively. The main reason is that large number of nodes (i.e., video frames) makes GCNs hard to capture and model temporal relations in videos. To tackle this problem, in this paper, we introduce an effective GCN module, Dilated Temporal Graph Reasoning Module (DTGRM), designed to model temporal relations and dependencies between video frames at various time spans. In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs where the nodes represent frames from different moments in video. Moreover, to enhance temporal reasoning ability of the proposed model, an auxiliary self-supervised task is proposed to encourage the dilated temporal graph reasoning module to find and correct wrong temporal relations in videos. Our DTGRM model outperforms state-of-the-art action segmentation models on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset. The code is available at //github.com/redwang/DTGRM.

FCN · 全卷積網絡 · 3D · 級聯 · MoDELS ·

2018 年 3 月 20 日

An application of cascaded 3D fully convolutional networks for medical image segmentation

Holger R. Roth,Hirohisa Oda,Xiangrong Zhou,Natsuki Shimizu,Ying Yang,Yuichiro Hayashi,Masahiro Oda,Michitaka Fujiwara,Kazunari Misawa,Kensaku Mori

from arxiv, Preprint accepted for publication in Computerized Medical Imaging and Graphics. Substantial extension of arXiv:1704.06382; Corrected references to figure numbers in this version

Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.