国产欧美日韩视频一区二区_黄色片视频免费观看国产_日韩欧中文字幕精品电影_免费人妻AV无码专区无码专区_在线观看亚洲精品自拍_老熟妇乱子伦A级视频_国产午夜影视二区三区四区

Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothing rate to quantify this issue. After confirming the high degree of oversmoothing in neural machine translation, we propose to explicitly minimize the oversmoothing rate during training. We conduct a set of experiments to study the effect of the proposed regularization on both model distribution and decoding performance. We use a neural machine translation task as the testbed and consider three different datasets of varying size. Our experiments reveal three major findings. First, we can control the oversmoothing rate of the model by tuning the strength of the regularization. Second, by enhancing the oversmoothing loss contribution, the probability and the rank of <eos> token decrease heavily at positions where it is not supposed to be. Third, the proposed regularization impacts the outcome of beam search especially when a large beam is used. The degradation of translation quality (measured in BLEU) with a large beam significantly lessens with lower oversmoothing rate, but the degradation compared to smaller beam sizes remains to exist. From these observations, we conclude that the high degree of oversmoothing is the main reason behind the degenerate case of overly probable short sequences in a neural autoregressive model.

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · 模型評估 · 深度前饋網絡 · 樸素貝葉斯分類器 · 學成 ·

2022 年 2 月 18 日

Multi-Objective Optimisation of Multi-Output Neural Trees

Varun Ojha,Giuseppe Nicosia

from arxiv, 19-24 July 2020

We propose an algorithm and a new method to tackle the classification problems. We propose a multi-output neural tree (MONT) algorithm, which is an evolutionary learning algorithm trained by the non-dominated sorting genetic algorithm (NSGA)-III. Since evolutionary learning is stochastic, a hypothesis found in the form of MONT is unique for each run of evolutionary learning, i.e., each hypothesis (tree) generated bears distinct properties compared to any other hypothesis both in topological space and parameter-space. This leads to a challenging optimisation problem where the aim is to minimise the tree-size and maximise the classification accuracy. Therefore, the Pareto-optimality concerns were met by hypervolume indicator analysis. We used nine benchmark classification learning problems to evaluate the performance of the MONT. As a result of our experiments, we obtained MONTs which are able to tackle the classification problems with high accuracy. The performance of MONT emerged better over a set of problems tackled in this study compared with a set of well-known classifiers: multilayer perceptron, reduced-error pruning tree, naive Bayes classifier, decision tree, and support vector machine. Moreover, the performances of three versions of MONT's training using genetic programming, NSGA-II, and NSGA-III suggest that the NSGA-III gives the best Pareto-optimal solution.

分離的 · Obvious · 正則化項 · 確切的 · FAST ·

2022 年 2 月 17 日

Chunk List: Concurrent Data Structures

Daniel Szelogowski

from arxiv, 20 pages, 3 figures A full implementation can be found at //github.com/danielathome19/Chunk-List Update: Revised format to align closer to IEEE standards

Chunking data is obviously no new concept; however, I had never found any data structures that used chunking as the basis of their implementation. I figured that by using chunking alongside concurrency, I could create an extremely fast run-time in regards to particular methods as searching and/or sorting. By using chunking and concurrency to my advantage, I came up with the chunk list - a dynamic list-based data structure that would separate large amounts of data into specifically sized chunks, each of which should be able to be searched at the exact same time by searching each chunk on a separate thread. As a result of implementing this concept into its own class, I was able to create something that almost consistently gives around 20x-300x faster results than a regular ArrayList. However, should speed be a particular issue even after implementation, users can modify the size of the chunks and benchmark the speed of using smaller or larger chunks, depending on the amount of data being stored.

層規范化 · 規范化的 · 圖 · 層 · Extensibility ·

2022 年 2 月 17 日

Revisiting Over-smoothing in BERT from the Perspective of Graph

Han Shi,Jiahui Gao,Hang Xu,Xiaodan Liang,Zhenguo Li,Lingpeng Kong,Stephen M. S. Lee,James T. Kwok

from arxiv, Accepted by ICLR 2022 (Spotlight)

Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.

語言模型化 · 規范化的 · MoDELS · 錯誤率 · 可約的 ·

2022 年 2 月 16 日

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Hao Zhang,You-Chi Cheng,Shankar Kumar,W. Ronny Huang,Mingqing Chen,Rajiv Mathews

from arxiv, arXiv admin note: substantial text overlap with arXiv:2108.11943

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

可約的 · Machine Translation · MoDELS · NMT · Extensibility ·

2018 年 5 月 29 日

Bi-Directional Neural Machine Translation with Synthetic Parallel Data

Xing Niu,Michael Denkowski,Marine Carpuat

from arxiv, Accepted at the 2nd Workshop on Neural Machine Translation and Generation (WNMT 2018)

Despite impressive progress in high-resource settings, Neural Machine Translation (NMT) still struggles in low-resource and out-of-domain scenarios, often failing to match the quality of phrase-based translation. We propose a novel technique that combines back-translation and multilingual NMT to improve performance in these difficult cases. Our technique trains a single model for both directions of a language pair, allowing us to back-translate source or target monolingual data without requiring an auxiliary model. We then continue training on the augmented parallel data, enabling a cycle of improvement for a single model that can incorporate any source, target, or parallel data to improve both translation directions. As a byproduct, these models can reduce training and deployment costs significantly compared to uni-directional models. Extensive experiments show that our technique outperforms standard back-translation in low-resource scenarios, improves quality on cross-domain tasks, and effectively reduces costs across the board.

平滑 · 語言模型化 · 極大似然 · 似然 · 真實值 ·

2018 年 5 月 14 日

Token-level and sequence-level loss smoothing for RNN language models

Maha Elbayad,Laurent Besacier,Jakob Verbeek

from arxiv, Accepted by ACL 2018

Despite the effectiveness of recurrent neural network language models, their maximum likelihood estimation suffers from two limitations. It treats all sentences that do not match the ground truth as equally poor, ignoring the structure of the output space. Second, it suffers from "exposure bias": during training tokens are predicted given ground-truth sequences, while at test time prediction is conditioned on generated output sequences. To overcome these limitations we build upon the recent reward augmented maximum likelihood approach \ie sequence-level smoothing that encourages the model to predict sentences close to the ground truth according to a given performance metric. We extend this approach to token-level loss smoothing, and propose improvements to the sequence-level smoothing approach. Our experiments on two different tasks, image captioning and machine translation, show that token-level and sequence-level loss smoothing are complementary, and significantly improve results.

詞向量表示 · Machine Translation · MoDELS · 注意力機制 · INFORMS ·

2018 年 5 月 10 日

Attention Focusing for Neural Machine Translation by Bridging Source and Target Embeddings

Shaohui Kuang,Junhui Li,António Branco,Weihua Luo,Deyi Xiong

from arxiv, 9 pages, 6 figures. Accepted by ACL2018

In neural machine translation, a source sequence of words is encoded into a vector from which a target sequence is generated in the decoding phase. Differently from statistical machine translation, the associations between source words and their possible target counterparts are not explicitly stored. Source and target words are at the two ends of a long information processing procedure, mediated by hidden states at both the source encoding and the target decoding phases. This makes it possible that a source word is incorrectly translated into a target word that is not any of its admissible equivalent counterparts in the target language. In this paper, we seek to somewhat shorten the distance between source and target words in that procedure, and thus strengthen their association, by means of a method we term bridging source and target word embeddings. We experiment with three strategies: (1) a source-side bridging model, where source word embeddings are moved one step closer to the output target sequence; (2) a target-side bridging model, which explores the more relevant source word embeddings for the prediction of the target sequence; and (3) a direct bridging model, which directly connects source and target word embeddings seeking to minimize errors in the translation of ones by the others. Experiments and analysis presented in this paper demonstrate that the proposed bridging models are able to significantly improve quality of both sentence translation, in general, and alignment and translation of individual source words with target words, in particular.

INFORMS · Machine Translation · 注意力機制 · 解碼 · Networking ·

2018 年 3 月 22 日

Self-Attentive Residual Decoder for Neural Machine Translation

Lesly Miculicich Werlen,Nikolaos Pappas,Dhananjay Ram,Andrei Popescu-Belis

from arxiv, Accepted on NAACL-HLT 2018

Neural sequence-to-sequence networks with attention have achieved remarkable performance for machine translation. One of the reasons for their effectiveness is their ability to capture relevant source-side contextual information at each time-step prediction through an attention mechanism. However, the target-side context is solely based on the sequence model which, in practice, is prone to a recency bias and lacks the ability to capture effectively non-sequential dependencies among words. To address this limitation, we propose a target-side-attentive residual recurrent network for decoding, where attention over previous words contributes directly to the prediction of the next word. The residual learning facilitates the flow of information from the distant past and is able to emphasize any of the previously translated words, hence it gains access to a wider context. The proposed model outperforms a neural MT baseline as well as a memory and self-attention network on three language pairs. The analysis of the attention learned by the decoder confirms that it emphasizes a wider context, and that it captures syntactic-like structures.

Machine Translation · MoDELS · TOOLS · 假設空間 · Performer ·

2018 年 2 月 28 日

Analyzing Uncertainty in Neural Machine Translation

Myle Ott,Michael Auli,David Granger,Marc'Aurelio Ranzato

Machine translation is a popular test bed for research in neural sequence-to-sequence models but despite much recent research, there is still a lack of understanding of these models. Practitioners report performance degradation with large beams, the under-estimation of rare words and a lack of diversity in the final translations. Our study relates some of these issues to the inherent uncertainty of the task, due to the existence of multiple valid translations for a single source sentence, and to the extrinsic uncertainty caused by noisy training data. We propose tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations. Our results show that search works remarkably well but that the models tend to spread too much probability mass over the hypothesis space. Next, we propose tools to assess model calibration and show how to easily fix some shortcomings of current models. We release both code and multiple human reference translations for two popular benchmarks.

測試誤差 · 極大似然估計 · Machine Translation · 最大似然估計 · 子采樣 ·

2018 年 1 月 29 日

SEARNN: Training RNNs with Global-Local Losses

Rémi Leblond,Jean-Baptiste Alayrac,Anton Osokin,Simon Lacoste-Julien

from arxiv, Accepted at ICLR 2018, 17 pages

We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We first demonstrate improved performance over MLE on two different tasks: OCR and spelling correction. Then, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. This allows us to validate the benefits of our approach on a machine translation task.