91婷婷国产精选国产色_日韩A精品日韩精品无码_人成在线免费视频_中文字幕在线免费观看黄色视频_久久久久久久毛片免费下载电影_精品国产亚洲AV麻豆尤物_韩国三级无码高在线观看

Modern methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames. In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.

相關內容

估(gu)計/估(gu)計量

關注 3

查準率/準確率 · 估計/估計量 · state-of-the-art · Neural Networks · 前向 ·

2021 年 10 月 3 日

Precise Object Placement with Pose Distance Estimations for Different Objects and Grippers

Kilian Kleeberger,Jonathan Schnitzler,Muhammad Usman Khalid,Richard Bormann,Werner Kraus,Marco F. Huber

from arxiv, Accepted at 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021)

This paper introduces a novel approach for the grasping and precise placement of various known rigid objects using multiple grippers within highly cluttered scenes. Using a single depth image of the scene, our method estimates multiple 6D object poses together with an object class, a pose distance for object pose estimation, and a pose distance from a target pose for object placement for each automatically obtained grasp pose with a single forward pass of a neural network. By incorporating model knowledge into the system, our approach has higher success rates for grasping than state-of-the-art model-free approaches. Furthermore, our method chooses grasps that result in significantly more precise object placements than prior model-based work.

Processing（編程語言） · SimPLe · CASE · 相關系數 · 線性回歸 ·

2021 年 10 月 1 日

A Review and Critique of Auxiliary Information-Based Process Monitoring Methods

Nesma A. Saleh,Mahmoud A. Mahmoud,William H. Woodall,Sven Knoth

from arxiv, 21 pages, 2 figures

We review the rapidly growing literature on auxiliary information-based (AIB) process monitoring methods. Under this approach, there is an assumption that the auxiliary variable, which is correlated with the quality variable of interest, has a known mean, or some other parameter, which cannot change over time. We demonstrate that violations of this assumption can have serious adverse effects both when the process is stable and when there has been a process shift. Some process shifts can become undetectable. We also show that the basic AIB approach is a special case of simple linear regression profile monitoring. The AIB charting techniques require strong assumptions. Based on our results, we warn against the use of AIB approach in quality control applications.

可辨認的 · CARS · INTERACT · 估計/估計量 · 可理解性 ·

2021 年 10 月 1 日

Omnimatte: Associating Objects and Their Effects in Video

Erika Lu,Forrester Cole,Tali Dekel,Andrew Zisserman,William T. Freeman,Michael Rubinstein

from arxiv, CVPR 2021 Oral. Project webpage: //omnimatte.github.io/. Added references

Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects -- shadows, reflections, generated smoke, etc -- are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject -- an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic -- it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.

層 · 相關系數 · 分離的 · 有向 · GROUP ·

2021 年 10 月 1 日

Layered Neural Rendering for Retiming People in Video

Erika Lu,Forrester Cole,Tali Dekel,Weidi Xie,Andrew Zisserman,David Salesin,William T. Freeman,Michael Rubinstein

from arxiv, In SIGGRAPH Asia 2020. Project webpage: //retiming.github.io/. Added references

We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video. A key property of our model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate -- e.g., shadows, reflections, and motion of loose clothing. The layers can be individually retimed and recombined into a new video, allowing us to achieve realistic, high-quality renderings of retiming effects for real-world videos depicting complex actions and involving multiple individuals, including dancing, trampoline jumping, or group running.

示例 · 端到端 · 變換 · MoDELS · 可理解性 ·

2021 年 3 月 24 日

End-to-End Video Instance Segmentation with Transformers

Yuqing Wang,Zhaoliang Xu,Xinlong Wang,Chunhua Shen,Baoshan Cheng,Hao Shen,Huaxia Xia

from arxiv, CVPR2021 Oral

Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches. Without bells and whistles, VisTR achieves the highest speed among all existing VIS models, and achieves the best result among methods using single model on the YouTube-VIS dataset. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy. We hope that VisTR can motivate future research for more video understanding tasks.

估計/估計量 · SCAN · Extensibility · 3D · 穩健性 ·

2021 年 1 月 17 日

MultiBodySync: Multi-Body Segmentation and Motion Estimation via 3D Scan Synchronization

Jiahui Huang,He Wang,Tolga Birdal,Minhyuk Sung,Federica Arrigoni,Shi-Min Hu,Leonidas Guibas

from arxiv, Contact: huang-jh18<at>mails<dot>tsinghua<dot>edu<dot>cn

We present MultiBodySync, a novel, end-to-end trainable multi-body motion segmentation and rigid registration framework for multiple input 3D point clouds. The two non-trivial challenges posed by this multi-scan multibody setting that we investigate are: (i) guaranteeing correspondence and segmentation consistency across multiple input point clouds capturing different spatial arrangements of bodies or body parts; and (ii) obtaining robust motion-based rigid body segmentation applicable to novel object categories. We propose an approach to address these issues that incorporates spectral synchronization into an iterative deep declarative network, so as to simultaneously recover consistent correspondences as well as motion segmentation. At the same time, by explicitly disentangling the correspondence and motion segmentation estimation modules, we achieve strong generalizability across different object categories. Our extensive evaluations demonstrate that our method is effective on various datasets ranging from rigid parts in articulated objects to individually moving objects in a 3D scene, be it single-view or full point clouds.

學成 · INTERACT · 推斷 · INFORMS · Performer ·

2020 年 3 月 26 日

Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects

Kiana Ehsani,Shubham Tulsiani,Saurabh Gupta,Ali Farhadi,Abhinav Gupta

from arxiv, CVPR 2020 -- (Oral presentation)

When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards a more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.

估計/估計量 · 塑造 · INTERACT · Networking · INFORMS ·

2019 年 3 月 8 日

Learning to Estimate Pose and Shape of Hand-Held Objects from RGB Images

Mia Kokic,Danica Kragic,Jeannette Bohg

We develop a system for modeling hand-object interactions in 3D from RGB images that show a hand which is holding a novel object from a known category. We design a Convolutional Neural Network (CNN) for Hand-held Object Pose and Shape estimation called HOPS-Net and utilize prior work to estimate the hand pose and configuration. We leverage the insight that information about the hand facilitates object pose and shape estimation by incorporating the hand into both training and inference of the object pose and shape as well as the refinement of the estimated pose. The network is trained on a large synthetic dataset of objects in interaction with a human hand. To bridge the gap between real and synthetic images, we employ an image-to-image translation model (Augmented CycleGAN) that generates realistically textured objects given a synthetic rendering. This provides a scalable way of generating annotated data for training HOPS-Net. Our quantitative experiments show that even noisy hand parameters significantly help object pose and shape estimation. The qualitative experiments show results of pose and shape estimation of objects held by a hand "in the wild".

Less · 可辨認的 · 無監督 · state-of-the-art · 學成 ·

2019 年 3 月 3 日

Less is More: Learning Highlight Detection from Video Duration

Bo Xiong,Yannis Kalantidis,Deepti Ghadiyaram,Kristen Grauman

from arxiv, To appear in CVPR 2019

Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, we introduce a novel ranking framework that prefers segments from shorter videos, while properly accounting for the inherent noise in the (unlabeled) training data. We use it to train a highlight detector with 10M hashtagged Instagram videos. In experiments on two challenging public video highlight detection benchmarks, our method substantially improves the state-of-the-art for unsupervised highlight detection.

Networking · 卷積 · 塑造 · 模型評估 · 數據集 ·

2018 年 2 月 1 日

DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild

R?za Alp Güler,Natalia Neverova,Iasonas Kokkinos

In this work, we establish dense correspondences between RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We first gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an 'inpainting' network that can fill in missing groundtruth values and report clear improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly0accurate results in real time. Supplementary materials and videos are provided on the project page //densepose.org