亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Motion estimation and motion compensation are indispensable parts of inter prediction in video coding. Since the motion vector of objects is mostly in fractional pixel units, original reference pictures may not accurately provide a suitable reference for motion compensation. In this paper, we propose a deep reference picture generator which can create a picture that is more relevant to the current encoding frame, thereby further reducing temporal redundancy and improving video compression efficiency. Inspired by the recent progress of Convolutional Neural Network(CNN), this paper proposes to use a dilated CNN to build the generator. Moreover, we insert the generated deep picture into Versatile Video Coding(VVC) as a reference picture and perform a comprehensive set of experiments to evaluate the effectiveness of our network on the latest VVC Test Model VTM. The experimental results demonstrate that our proposed method achieves on average 9.7% bit saving compared with VVC under low-delay P configuration.

相關內容

With the recently massive development in convolution neural networks, numerous lightweight CNN-based image super-resolution methods have been proposed for practical deployments on edge devices. However, most existing methods focus on one specific aspect: network or loss design, which leads to the difficulty of minimizing the model size. To address the issue, we conclude block devising, architecture searching, and loss design to obtain a more efficient SR structure. In this paper, we proposed an edge-enhanced feature distillation network, named EFDN, to preserve the high-frequency information under constrained resources. In detail, we build an edge-enhanced convolution block based on the existing reparameterization methods. Meanwhile, we propose edge-enhanced gradient loss to calibrate the reparameterized path training. Experimental results show that our edge-enhanced strategies preserve the edge and significantly improve the final restoration quality. Code is available at //github.com/icandle/EFDN.

Deep learning has achieved remarkable results in many computer vision tasks. Deep neural networks typically rely on large amounts of training data to avoid overfitting. However, labeled data for real-world applications may be limited. By improving the quantity and diversity of training data, data augmentation has become an inevitable part of deep learning model training with image data. As an effective way to improve the sufficiency and diversity of training data, data augmentation has become a necessary part of successful application of deep learning models on image data. In this paper, we systematically review different image data augmentation methods. We propose a taxonomy of reviewed methods and present the strengths and limitations of these methods. We also conduct extensive experiments with various data augmentation methods on three typical computer vision tasks, including semantic segmentation, image classification and object detection. Finally, we discuss current challenges faced by data augmentation and future research directions to put forward some useful research guidance.

In recent years, channel attention mechanism has been widely investigated due to its great potential in improving the performance of deep convolutional neural networks (CNNs) in many vision tasks. However, in most of the existing methods, only the output of the adjacent convolution layer is fed into the attention layer for calculating the channel weights. Information from other convolution layers has been ignored. With these observations, a simple strategy, named Bridge Attention Net (BA-Net), is proposed in this paper for better performance with channel attention mechanisms. The core idea of this design is to bridge the outputs of the previous convolution layers through skip connections for channel weights generation. Based on our experiment and theory analysis, we find that features from previous layers also contribute to the weights significantly. The Comprehensive evaluation demonstrates that the proposed approach achieves state-of-the-art(SOTA) performance compared with the existing methods in accuracy and speed. which shows that Bridge Attention provides a new perspective on the design of neural network architectures with great potential in improving performance. The code is available at //github.com/zhaoy376/Bridge-Attention.

The detection of tiny objects in microscopic videos is a problematic point, especially in large-scale experiments. For tiny objects (such as sperms) in microscopic videos, current detection methods face challenges in fuzzy, irregular, and precise positioning of objects. In contrast, we present a convolutional neural network for tiny object detection (TOD-CNN) with an underlying data set of high-quality sperm microscopic videos (111 videos, $>$ 278,000 annotated objects), and a graphical user interface (GUI) is designed to employ and test the proposed model effectively. TOD-CNN is highly accurate, achieving $85.60\%$ AP$_{50}$ in the task of real-time sperm detection in microscopic videos. To demonstrate the importance of sperm detection technology in sperm quality analysis, we carry out relevant sperm quality evaluation metrics and compare them with the diagnosis results from medical doctors.

Crowd counting is a challenging problem due to the scene complexity and scale variation. Although deep learning has achieved great improvement in crowd counting, scene complexity affects the judgement of these methods and they usually regard some objects as people mistakenly; causing potentially enormous errors in the crowd counting result. To address the problem, we propose a novel end-to-end model called Crowd Attention Convolutional Neural Network (CAT-CNN). Our CAT-CNN can adaptively assess the importance of a human head at each pixel location by automatically encoding a confidence map. With the guidance of the confidence map, the position of human head in estimated density map gets more attention to encode the final density map, which can avoid enormous misjudgements effectively. The crowd count can be obtained by integrating the final density map. To encode a highly refined density map, the total crowd count of each image is classified in a designed classification task and we first explicitly map the prior of the population-level category to feature maps. To verify the efficiency of our proposed method, extensive experiments are conducted on three highly challenging datasets. Results establish the superiority of our method over many state-of-the-art methods.

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of past and current baseline approaches and an in-depth study of recent advancements in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth. Architectures and datasets used in these applications are also discussed, along with their evaluation metrics. Last, main issues are highlighted separately for each domain along with their possible future research directions.

Knowledge is a formal way of understanding the world, providing a human-level cognition and intelligence for the next-generation artificial intelligence (AI). One of the representations of knowledge is the structural relations between entities. An effective way to automatically acquire this important knowledge, called Relation Extraction (RE), a sub-task of information extraction, plays a vital role in Natural Language Processing (NLP). Its purpose is to identify semantic relations between entities from natural language text. To date, there are several studies for RE in previous works, which have documented these techniques based on Deep Neural Networks (DNNs) become a prevailing technique in this research. Especially, the supervised and distant supervision methods based on DNNs are the most popular and reliable solutions for RE. This article 1)introduces some general concepts, and further 2)gives a comprehensive overview of DNNs in RE from two points of view: supervised RE, which attempts to improve the standard RE systems, and distant supervision RE, which adopts DNNs to design the sentence encoder and the de-noise method. We further 3)cover some novel methods and describe some recent trends and discuss possible future research directions for this task.

We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.

Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.

北京阿比特科技有限公司