亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Decreasing projection views to lower X-ray radiation dose usually leads to severe streak artifacts. To improve image quality from sparse-view data, a Multi-domain Integrative Swin Transformer network (MIST-net) was developed in this article. First, MIST-net incorporated lavish domain features from data, residual-data, image, and residual-image using flexible network architectures, where residual-data and residual-image sub-network was considered as data consistency module to eliminate interpolation and reconstruction errors. Second, a trainable edge enhancement filter was incorporated to detect and protect image edges. Third, a high-quality reconstruction Swin transformer (i.e., Recformer) was designed to capture image global features. The experiment results on numerical and real cardiac clinical datasets with 48-views demonstrated that our proposed MIST-net provided better image quality with more small features and sharp edges than other competitors.

相關內容

We propose methods to train convolutional neural networks (CNNs) with both binarized weights and activations, leading to quantized models that are specifically friendly to mobile devices with limited power capacity and computation resources. Previous works on quantizing CNNs often seek to approximate the floating-point information using a set of discrete values, which we call value approximation, typically assuming the same architecture as the full-precision networks. Here we take a novel "structure approximation" view of quantization -- it is very likely that different architectures designed for low-bit networks may be better for achieving good performance. In particular, we propose a "network decomposition" strategy, termed Group-Net, in which we divide the network into groups. Thus, each full-precision group can be effectively reconstructed by aggregating a set of homogeneous binary branches. In addition, we learn effective connections among groups to improve the representation capability. Moreover, the proposed Group-Net shows strong generalization to other tasks. For instance, we extend Group-Net for accurate semantic segmentation by embedding rich context into the binary structure. Furthermore, for the first time, we apply binary neural networks to object detection. Experiments on both classification, semantic segmentation and object detection tasks demonstrate the superior performance of the proposed methods over various quantized networks in the literature. Our methods outperform the previous best binary neural networks in terms of accuracy and computation efficiency.

Accurate brain tumor segmentation from Magnetic Resonance Imaging (MRI) is desirable to joint learning of multimodal images. However, in clinical practice, it is not always possible to acquire a complete set of MRIs, and the problem of missing modalities causes severe performance degradation in existing multimodal segmentation methods. In this work, we present the first attempt to exploit the Transformer for multimodal brain tumor segmentation that is robust to any combinatorial subset of available modalities. Concretely, we propose a novel multimodal Medical Transformer (mmFormer) for incomplete multimodal learning with three main components: the hybrid modality-specific encoders that bridge a convolutional encoder and an intra-modal Transformer for both local and global context modeling within each modality; an inter-modal Transformer to build and align the long-range correlations across modalities for modality-invariant features with global semantics corresponding to tumor region; a decoder that performs a progressive up-sampling and fusion with the modality-invariant features to generate robust segmentation. Besides, auxiliary regularizers are introduced in both encoder and decoder to further enhance the model's robustness to incomplete modalities. We conduct extensive experiments on the public BraTS $2018$ dataset for brain tumor segmentation. The results demonstrate that the proposed mmFormer outperforms the state-of-the-art methods for incomplete multimodal brain tumor segmentation on almost all subsets of incomplete modalities, especially by an average 19.07% improvement of Dice on tumor segmentation with only one available modality. The code is available at //github.com/YaoZhang93/mmFormer.

As manipulating images by copy-move, splicing and/or inpainting may lead to misinterpretation of the visual content, detecting these sorts of manipulations is crucial for media forensics. Given the variety of possible attacks on the content, devising a generic method is nontrivial. Current deep learning based methods are promising when training and test data are well aligned, but perform poorly on independent tests. Moreover, due to the absence of authentic test images, their image-level detection specificity is in doubt. The key question is how to design and train a deep neural network capable of learning generalizable features sensitive to manipulations in novel data, whilst specific to prevent false alarms on the authentic. We propose multi-view feature learning to jointly exploit tampering boundary artifacts and the noise view of the input image. As both clues are meant to be semantic-agnostic, the learned features are thus generalizable. For effectively learning from authentic images, we train with multi-scale (pixel / edge / image) supervision. We term the new network MVSS-Net and its enhanced version MVSS-Net++. Experiments are conducted in both within-dataset and cross-dataset scenarios, showing that MVSS-Net++ performs the best, and exhibits better robustness against JPEG compression, Gaussian blur and screenshot based image re-capturing.

Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.

Biometrics on mobile devices has attracted a lot of attention in recent years as it is considered a user-friendly authentication method. This interest has also been motivated by the success of Deep Learning (DL). Architectures based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been established to be convenient for the task, improving the performance and robustness in comparison to traditional machine learning techniques. However, some aspects must still be revisited and improved. To the best of our knowledge, this is the first article that intends to explore and propose novel gait biometric recognition systems based on Transformers, which currently obtain state-of-the-art performance in many applications. Several state-of-the-art architectures (Vanilla, Informer, Autoformer, Block-Recurrent Transformer, and THAT) are considered in the experimental framework. In addition, new configurations of the Transformers are proposed to further increase the performance. Experiments are carried out using the two popular public databases whuGAIT and OU-ISIR. The results achieved prove the high ability of the proposed Transformer, outperforming state-of-the-art CNN and RNN architectures.

It is time-consuming and expensive to take high-quality or high-resolution electron microscopy (EM) and fluorescence microscopy (FM) images. Taking these images could be even invasive to samples and may damage certain subtleties in the samples after long or intense exposures, often necessary for achieving high-quality or high resolution in the first place. Advances in deep learning enable us to perform image-to-image transformation tasks for various types of microscopy image reconstruction, computationally producing high-quality images from the physically acquired low-quality ones. When training image-to-image transformation models on pairs of experimentally acquired microscopy images, prior models suffer from performance loss due to their inability to capture inter-image dependencies and common features shared among images. Existing methods that take advantage of shared features in image classification tasks cannot be properly applied to image reconstruction tasks because they fail to preserve the equivariance property under spatial permutations, something essential in image-to-image transformation. To address these limitations, we propose the augmented equivariant attention networks (AEANets) with better capability to capture inter-image dependencies, while preserving the equivariance property. The proposed AEANets captures inter-image dependencies and shared features via two augmentations on the attention mechanism, which are the shared references and the batch-aware attention during training. We theoretically derive the equivariance property of the proposed augmented attention model and experimentally demonstrate its consistent superiority in both quantitative and visual results over the baseline methods.

In this paper, we propose a novel Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition. We view the expression information as the combination of the shared information (expression similarities) across different expressions and the unique information (expression-specific variations) for each expression. More specifically, FDRL mainly consists of two crucial networks: a Feature Decomposition Network (FDN) and a Feature Reconstruction Network (FRN). In particular, FDN first decomposes the basic features extracted from a backbone network into a set of facial action-aware latent features to model expression similarities. Then, FRN captures the intra-feature and inter-feature relationships for latent features to characterize expression-specific variations, and reconstructs the expression feature. To this end, two modules including an intra-feature relation modeling module and an inter-feature relation modeling module are developed in FRN. Experimental results on both the in-the-lab databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild databases (including RAF-DB and SFEW) show that the proposed FDRL method consistently achieves higher recognition accuracy than several state-of-the-art methods. This clearly highlights the benefit of feature decomposition and reconstruction for classifying expressions.

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report 79.0% accuracy on the Kinetics dataset without using any pre-training, largely surpassing the previous best results of this kind. On AVA action detection we achieve a new state-of-the-art of 28.3 mAP. Code will be made publicly available.

北京阿比特科技有限公司