精品亚洲中文一区二区三区_日本一卡2卡3卡4卡乱码免费网站_中国东北老肥熟露脸视频_欧美成在线手机版1003_亚洲欧美在线搜索精品性色_蜜臀AV一区二区三区人妻_一级女人18片毛片蜜桃AV直播

Recent advances in omnidirectional cameras and AR/VR headsets have spurred the adoption of 360-degree videos that are widely believed to be the future of online video streaming. 360-degree videos allow users to wear a head-mounted display (HMD) and experience the video as if they are physically present in the scene. Streaming high-quality 360-degree videos at scale is an unsolved problem that is more challenging than traditional (2D) video delivery. The data rate required to stream 360-degree videos is an order of magnitude more than traditional videos. Further, the penalty for rebuffering events where the video freezes or displays a blank screen is more severe as it may cause cybersickness. We propose an online adaptive bitrate (ABR) algorithm for 360-degree videos called BOLA360 that runs inside the client's video player and orchestrates the download of video segments from the server so as to maximize the quality-of-experience (QoE) of the user. BOLA360 conserves bandwidth by downloading only those video segments that are likely to fall within the field-of-view (FOV) of the user. In addition, BOLA360 continually adapts the bitrate of the downloaded video segments so as to enable a smooth playback without rebuffering. We prove that BOLA360 is near-optimal with respect to an optimal offline algorithm that maximizes QoE. Further, we evaluate BOLA360 on a wide range of network and user head movement profiles and show that it provides $13.6\%$ to $372.5\%$ more QoE than state-of-the-art algorithms. While ABR algorithms for traditional (2D) videos have been well-studied over the last decade, our work is the first ABR algorithm for 360-degree videos with both theoretical and empirical guarantees on its performance.

相關內容

QoE

關注 0

MoDELS · PULSE · Learning · state-of-the-art · Performer ·

2023 年 10 月 23 日

CalibrationPhys: Self-supervised Video-based Heart and Respiratory Rate Measurements by Calibrating Between Multiple Cameras

Yusuke Akamatsu,Terumi Umematsu,Hitoshi Imaoka

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Video-based heart and respiratory rate measurements using facial videos are more useful and user-friendly than traditional contact-based sensors. However, most of the current deep learning approaches require ground-truth pulse and respiratory waves for model training, which are expensive to collect. In this paper, we propose CalibrationPhys, a self-supervised video-based heart and respiratory rate measurement method that calibrates between multiple cameras. CalibrationPhys trains deep learning models without supervised labels by using facial videos captured simultaneously by multiple cameras. Contrastive learning is performed so that the pulse and respiratory waves predicted from the synchronized videos using multiple cameras are positive and those from different videos are negative. CalibrationPhys also improves the robustness of the models by means of a data augmentation technique and successfully leverages a pre-trained model for a particular camera. Experimental results utilizing two datasets demonstrate that CalibrationPhys outperforms state-of-the-art heart and respiratory rate measurement methods. Since we optimize camera-specific models using only videos from multiple cameras, our approach makes it easy to use arbitrary cameras for heart and respiratory rate measurements.

估計/估計量 · 點云 · SIFT · SURF · 傳感器 ·

2023 年 10 月 23 日

Converting Depth Images and Point Clouds for Feature-based Pose Estimation

Robert L?sch,Mark Sastuba,Jonas Toth,Bernhard Jung

from arxiv, to be published in IROS 2023 conference proceedings

In recent years, depth sensors have become more and more affordable and have found their way into a growing amount of robotic systems. However, mono- or multi-modal sensor registration, often a necessary step for further processing, faces many challenges on raw depth images or point clouds. This paper presents a method of converting depth data into images capable of visualizing spatial details that are basically hidden in traditional depth images. After noise removal, a neighborhood of points forms two normal vectors whose difference is encoded into this new conversion. Compared to Bearing Angle images, our method yields brighter, higher-contrast images with more visible contours and more details. We tested feature-based pose estimation of both conversions in a visual odometry task and RGB-D SLAM. For all tested features, AKAZE, ORB, SIFT, and SURF, our new Flexion images yield better results than Bearing Angle images and show great potential to bridge the gap between depth data and classical computer vision. Source code is available here: //rlsch.github.io/depth-flexion-conversion.

Networking · 圖 · MoDELS · 任務對話系統 · state-of-the-art ·

2023 年 10 月 23 日

IRRGN: An Implicit Relational Reasoning Graph Network for Multi-turn Response Selection

Jingcheng Deng,Hengwei Dai,Xuewei Guo,Yuanchen Ju,Wei Peng

from arxiv, Accepted by EMNLP 2022

The task of response selection in multi-turn dialogue is to find the best option from all candidates. In order to improve the reasoning ability of the model, previous studies pay more attention to using explicit algorithms to model the dependencies between utterances, which are deterministic, limited and inflexible. In addition, few studies consider differences between the options before and after reasoning. In this paper, we propose an Implicit Relational Reasoning Graph Network to address these issues, which consists of the Utterance Relational Reasoner (URR) and the Option Dual Comparator (ODC). URR aims to implicitly extract dependencies between utterances, as well as utterances and options, and make reasoning with relational graph convolutional networks. ODC focuses on perceiving the difference between the options through dual comparison, which can eliminate the interference of the noise options. Experimental results on two multi-turn dialogue reasoning benchmark datasets MuTual and MuTual+ show that our method significantly improves the baseline of four pretrained language models and achieves state-of-the-art performance. The model surpasses human performance for the first time on the MuTual dataset.

模態 · HTTPS · 可理解性 · 數據集 · 模型評估 ·

2023 年 10 月 23 日

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu,Bin Lin,Munan Ning,Yang Yan,Jiaxi Cui,HongFa Wang,Yatian Pang,Wenhao Jiang,Junwu Zhang,Zongwei Li,Wancai Zhang,Zhifeng Li,Wei Liu,Li Yuan

from arxiv, Under review as a conference paper at ICLR 2024

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 5.8% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval task. Beyond this, our LanguageBind has greatly improved in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, LanguageBind surpassing InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD, 6.3% on DiDeMo, and 4.4% on ActivityNet. On the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind with 23.8% and 11.1% top-1 accuracy. Code address: //github.com/PKU-YuanGroup/LanguageBind.

Continuity · TOOLS · Analysis · 數據分析 · 穩健性 ·

2023 年 10 月 22 日

CMDA: a tool for Continuous Monitoring Data Analysis

Pejman Farhadi Ghalati,Andreas Schuppert

Over the last few years, with the growth of time-series collecting and storing, there has been a great demand for tools and software for temporal data engineering and modeling. This paper presents a generic workflow for time series data research, including temporal data importing, preprocessing, and feature extraction. This framework is developed and built as a robust and easy-to-use Python package, called CMDA, with a modular structure that offers tools to prepare raw data, allowing both scientists and non-experts to analyze various temporal data structures.

可辨認的 · state-of-the-art · HTTPS · 真實值 · 數據集 ·

2023 年 10 月 20 日

Reference-based Restoration of Digitized Analog Videotapes

Lorenzo Agnolucci,Leonardo Galteri,Marco Bertini,Alberto Del Bimbo

Analog magnetic tapes have been the main video data storage device for several decades. Videos stored on analog videotapes exhibit unique degradation patterns caused by tape aging and reader device malfunctioning that are different from those observed in film and digital video restoration tasks. In this work, we present a reference-based approach for the resToration of digitized Analog videotaPEs (TAPE). We leverage CLIP for zero-shot artifact detection to identify the cleanest frames of each video through textual prompts describing different artifacts. Then, we select the clean frames most similar to the input ones and employ them as references. We design a transformer-based Swin-UNet network that exploits both neighboring and reference frames via our Multi-Reference Spatial Feature Fusion (MRSFF) blocks. MRSFF blocks rely on cross-attention and attention pooling to take advantage of the most useful parts of each reference frame. To address the absence of ground truth in real-world videos, we create a synthetic dataset of videos exhibiting artifacts that closely resemble those commonly found in analog videotapes. Both quantitative and qualitative experiments show the effectiveness of our approach compared to other state-of-the-art methods. The code, the model, and the synthetic dataset are publicly available at //github.com/miccunifi/TAPE.

MoDELS · 特化 · Vision · 跡 · Facebook AI Research ·

2023 年 10 月 20 日

Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks

Haobo Song,Soumajit Majumder,Tao Lin

Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models with elegant solution-finding procedures and constant memory footprint. However, despite several attempts, these methods are heavily constrained by model inefficiency and optimization instability. Furthermore, fair benchmarking across relevant methods for vision tasks is missing. In this work, we revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants. Through the lens of these simple-yet-clean weight-tied models, we further study the fundamental limits in the model capacity of such models and propose the use of distinct sparse masks to improve the model capacity. Finally, for practitioners, we offer design guidelines regarding the depth, width, and sparsity selection for weight-tied models, and demonstrate the generalizability of our insights to other learning paradigms.

DeepFakes · Networking · 集成 · 數據集 · 模態 ·

2023 年 10 月 19 日

AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection

Ammarah Hashmi,Sahibzada Adil Shahzad,Chia-Wen Lin,Yu Tsao,Hsin-Min Wang

Forged content shared widely on social media platforms is a major social problem that requires increased regulation and poses new challenges to the research community. The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilizes visual modality or audio modality. While there are some methods in the literature that exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multi-modal datasets of deepfake videos involving acoustic and visual manipulations. Moreover, these existing methods are mostly based on CNN and suffer from low detection accuracy. Inspired by the recent success of Transformer in various fields, to address the challenges posed by deepfake technology, in this paper, we propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation to achieve effective video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multi-modal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that our best model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset.

視覺問答 · 自頂向下 · 圖像字幕 · 注意力機制 · 自下而上 ·

2018 年 3 月 14 日

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang

from arxiv, CVPR 2018 full oral, winner of the 2017 Visual Question Answering challenge

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

2018 年 1 月 5 日

Deep Reinforcement Learning for List-wise Recommendations

Xiangyu Zhao,Liang Zhang,Zhuoye Ding,Dawei Yin,Yihong Zhao,Jiliang Tang

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.