亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Biometrics on mobile devices has attracted a lot of attention in recent years as it is considered a user-friendly authentication method. This interest has also been motivated by the success of Deep Learning (DL). Architectures based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been established to be convenient for the task, improving the performance and robustness in comparison to traditional machine learning techniques. However, some aspects must still be revisited and improved. To the best of our knowledge, this is the first article that intends to explore and propose novel gait biometric recognition systems based on Transformers, which currently obtain state-of-the-art performance in many applications. Several state-of-the-art architectures (Vanilla, Informer, Autoformer, Block-Recurrent Transformer, and THAT) are considered in the experimental framework. In addition, new configurations of the Transformers are proposed to further increase the performance. Experiments are carried out using the two popular public databases whuGAIT and OU-ISIR. The results achieved prove the high ability of the proposed Transformer, outperforming state-of-the-art CNN and RNN architectures.

相關內容

Attention mechanisms have emerged as important tools that boost the performance of deep models by allowing them to focus on key parts of learned embeddings. However, current attention mechanisms used in speaker recognition tasks fail to consider fine-grained information items such as frequency bins in input spectral representations used by the deep networks. To address this issue, we propose the novel Fine-grained Early Frequency Attention (FEFA) for speaker recognition in-the-wild. Once integrated into a deep neural network, our proposed mechanism works by obtaining queries from early layers of the network and generating learnable weights to attend to information items as small as the frequency bins in the input spectral representations. To evaluate the performance of FEFA, we use several well-known deep models as backbone networks and integrate our attention module in their pipelines. The overall performance of these networks (with and without FEFA) are evaluated on the VoxCeleb1 dataset, where we observe considerable improvements when FEFA is used.

Numerous studies in the literature have already shown the potential of biometrics on mobile devices for authentication purposes. However, it has been shown that, the learning processes associated to biometric systems might expose sensitive personal information about the subjects. This study proposes GaitPrivacyON, a novel mobile gait biometrics verification approach that provides accurate authentication results while preserving the sensitive information of the subject. It comprises two modules: i) a convolutional Autoencoder that transforms attributes of the biometric raw data, such as the gender or the activity being performed, into a new privacy-preserving representation; and ii) a mobile gait verification system based on the combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) with a Siamese architecture. The main advantage of GaitPrivacyON is that the first module (convolutional Autoencoder) is trained in an unsupervised way, without specifying the sensitive attributes of the subject to protect. The experimental results achieved using two popular databases (MotionSense and MobiAct) suggest the potential of GaitPrivacyON to significantly improve the privacy of the subject while keeping user authentication results higher than 99% Area Under the Curve (AUC). To the best of our knowledge, this is the first mobile gait verification approach that considers privacy-preserving methods trained in an unsupervised way.

In the last five years, the rise of the self-attentional Transformer-based architectures led to state-of-the-art performances over many natural language tasks. Although these approaches are increasingly popular, they require large amounts of data and computational resources. There is still a substantial need for benchmarking methodologies ever upwards on under-resourced languages in data-scarce application conditions. Most pre-trained language models were massively studied using the English language and only a few of them were evaluated on French. In this paper, we propose a unified benchmark, focused on evaluating models quality and their ecological impact on two well-known French spoken language understanding tasks. Especially we benchmark thirteen well-established Transformer-based models on the two available spoken language understanding tasks for French: MEDIA and ATIS-FR. Within this framework, we show that compact models can reach comparable results to bigger ones while their ecological impact is considerably lower. However, this assumption is nuanced and depends on the considered compression method.

Understanding temporal dynamics of video is an essential aspect of learning better video representations. Recently, transformer-based architectural designs have been extensively explored for video tasks due to their capability to capture long-term dependency of input sequences. However, we found that these Video Transformers are still biased to learn spatial dynamics rather than temporal ones, and debiasing the spurious correlation is critical for their performance. Based on the observations, we design simple yet effective self-supervised tasks for video models to learn temporal dynamics better. Specifically, for debiasing the spatial bias, our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs. Also, our method learns the temporal flow direction of video tokens among consecutive frames for enhancing the correlation toward temporal dynamics. Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.

The ever-increasing computational complexity of deep learning models makes their training and deployment difficult on various cloud and edge platforms. Replacing floating-point arithmetic with low-bit integer arithmetic is a promising approach to save energy, memory footprint, and latency of deep learning models. As such, quantization has attracted the attention of researchers in recent years. However, using integer numbers to form a fully functional integer training pipeline including forward pass, back-propagation, and stochastic gradient descent is not studied in detail. Our empirical and mathematical results reveal that integer arithmetic is enough to train deep learning models. Unlike recent proposals, instead of quantization, we directly switch the number representation of computations. Our novel training method forms a fully integer training pipeline that does not change the trajectory of the loss and accuracy compared to floating-point, nor does it need any special hyper-parameter tuning, distribution adjustment, or gradient clipping. Our experimental results show that our proposed method is effective in a wide variety of tasks such as classification (including vision transformers), object detection, and semantic segmentation.

Human activity recognition (HAR) is a machine learning task with applications in many domains including health care, but it has proven a challenging research problem. In health care, it is used mainly as an assistive technology for elder care, often used together with other related technologies such as the Internet of Things (IoT) because HAR can be achieved with the help of IoT devices such as smartphones, wearables, environmental and on-body sensors. Deep neural network techniques like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used for HAR, both in centralized and federated settings. However, these techniques have certain limitations: RNNs cannot be easily parallelized, CNNs have the limitation of sequence length, and both are computationally expensive. Moreover, the centralized approach has privacy concerns when facing sensitive applications such as healthcare. In this paper, to address some of the existing challenges facing HAR, we present a novel one-patch transformer based on inertial sensors that can combine the advantages of RNNs and CNNs without their major limitations. We designed a testbed to collect real-time human activity data and used the data to train and test the proposed transformer-based HAR classifier. We also propose TransFed: a federated learning-based HAR classifier using the proposed transformer to address privacy concerns. The experimental results showed that the proposed solution outperformed the state-of-the-art HAR classifiers based on CNNs and RNNs, in both federated and centralized settings. Moreover, the proposed HAR classifier is computationally inexpensive as it uses much fewer parameters than existing CNN/RNN-based classifiers.

Behavioural biometrics have proven to be effective against identity theft as well as be considered user-friendly authentication methods. One of the most popular traits in the literature is keystroke dynamics due to the large deployment of computers and mobile devices in our society. This paper focuses on improving keystroke biometric systems on the free-text scenario. This scenario is characterised as very challenging due to the uncontrolled text conditions, the influential of the user's emotional and physical state, and the in-use application. To overcome these drawbacks, methods based on deep learning such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been proposed in the literature, outperforming traditional machine learning methods. However, these architectures still have aspects that need to be reviewed and improved. To the best of our knowledge, this is the first study that proposes keystroke biometric systems based on Transformers. The proposed Transformer architecture has achieved Equal Error Rate (EER) values of 3.84% in the popular Aalto mobile keystroke database using only 5 enrolment sessions, outperforming in large margin other state-of-the-art approaches in the literature.

Most action recognition models today are highly parameterized, and evaluated on datasets with predominantly spatially distinct classes. It has also been shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape in still image recognition tasks. Taken together, this raises suspicion that large video models partly learn spurious correlations rather than to track relevant shapes over time to infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence. In this article, we empirically study whether the choice of low-level temporal modeling has consequences for texture bias and cross-domain robustness. In order to enable a light-weight and systematic assessment of the ability to capture temporal structure, not revealed from single frames, we provide the Temporal Shape (TS) dataset, as well as modified domains of Diving48 allowing for the investigation of texture bias for video models. We find that across a variety of model sizes, convolutional-recurrent and attention-based models show better out-of-domain robustness on TS than 3D CNNs. In domain shift experiments on Diving48, our experiments indicate that 3D CNNs and attention-based models exhibit more texture bias than convolutional-recurrent models. Moreover, qualitative examples suggest that convolutional-recurrent models learn more correct class attributes from the diving data when compared to the other two types of models at the same global validation performance.

Creating agents that can both appropriately respond to conversations and understand complex human linguistic tendencies and social cues has been a long standing challenge in the NLP community. A recent pillar of research revolves around emotion recognition in conversation (ERC); a sub-field of emotion recognition that focuses on conversations or dialogues that contain two or more utterances. In this work, we explore an approach to ERC that exploits the use of neural embeddings along with complex structures in dialogues. We implement our approach in a framework called Probabilistic Soft Logic (PSL), a declarative templating language that uses first-order like logical rules, that when combined with data, define a particular class of graphical model. Additionally, PSL provides functionality for the incorporation of results from neural models into PSL models. This allows our model to take advantage of advanced neural methods, such as sentence embeddings, and logical reasoning over the structure of a dialogue. We compare our method with state-of-the-art purely neural ERC systems, and see almost a 20% improvement. With these results, we provide an extensive qualitative and quantitative analysis over the DailyDialog conversation dataset.

Detection and recognition of text in natural images are two main problems in the field of computer vision that have a wide variety of applications in analysis of sports videos, autonomous driving, industrial automation, to name a few. They face common challenging problems that are factors in how text is represented and affected by several environmental conditions. The current state-of-the-art scene text detection and/or recognition methods have exploited the witnessed advancement in deep learning architectures and reported a superior accuracy on benchmark datasets when tackling multi-resolution and multi-oriented text. However, there are still several remaining challenges affecting text in the wild images that cause existing methods to underperform due to there models are not able to generalize to unseen data and the insufficient labeled data. Thus, unlike previous surveys in this field, the objectives of this survey are as follows: first, offering the reader not only a review on the recent advancement in scene text detection and recognition, but also presenting the results of conducting extensive experiments using a unified evaluation framework that assesses pre-trained models of the selected methods on challenging cases, and applies the same evaluation criteria on these techniques. Second, identifying several existing challenges for detecting or recognizing text in the wild images, namely, in-plane-rotation, multi-oriented and multi-resolution text, perspective distortion, illumination reflection, partial occlusion, complex fonts, and special characters. Finally, the paper also presents insight into the potential research directions in this field to address some of the mentioned challenges that are still encountering scene text detection and recognition techniques.

北京阿比特科技有限公司