亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

<tr id='0M6PE'><strong id='a6GBL'></strong><small id='B7Zup'></small><button id='kJTea'></button><li id='2s2QT'><noscript id='gA2wT'><big id='NNGqn'></big><dt id='EmVkl'></dt></noscript></li></tr><ol id='aEFWY'><option id='NmPlZ'><table id='7LoHo'><blockquote id='QUo6x'><tbody id='JdkS9'></tbody></blockquote></table></option></ol><u id='XAIgV'></u><kbd id='8XJ4a'><kbd id='wXf3f'></kbd></kbd>

<code id='FVoqu'><strong id='LHPPW'></strong></code>

<fieldset id='e18X4'></fieldset>

<span id='WyZ9t'></span>

<ins id='jBg1D'></ins>

<acronym id='171Tb'><em id='4Wrn0'></em><td id='zOgDh'><div id='7Igc5'></div></td></acronym><address id='ZMHC0'><big id='pNgnM'><big id='u27wy'></big><legend id='Xdg88'></legend></big></address>

<i id='M4pXN'><div id='Rlwfh'><ins id='JsJ0M'></ins></div></i>

<i id='BfsY1'></i>

·

entity · 命名實體識別 · MoDELS · Performer · NLP ·

2022 年 1 月 18 日

Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Hang Jiang,Yining Hua,Doug Beeferman,Deb Roy

Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging, dependency parsing) and NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art NLP models. We first annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train a Stanza NER model on the new benchmark, achieving competitive performance against other non-transformer NER systems. Finally, we train other Twitter NLP models (a tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2 based on Stanza, and achieve state-of-the-art or competitive performance on these tasks. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research. Our source code, data, and pre-trained models are available at: \url{//github.com/social-machines/TweebankNLP}.

相關內容

entity

語音合成 · 講稿 · 話題 · MoDELS · 穩健性 ·

2022 年 4 月 20 日

KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics

Saida Mussakhojayeva,Yerbolat Khassanov,Huseyin Atakan Varol

from arxiv, 8 pages, 2 figures, 5 tables, accepted to LREC 2022

We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage has been diversified with the help of new sources, including a book and Wikipedia articles. This corpus is necessary for building high-quality TTS systems for Kazakh, a Central Asian agglutinative language from the Turkic family, which presents several linguistic challenges. We describe the corpus construction process and provide the details of the training and evaluation procedures for the TTS system. Our experimental results indicate that the constructed corpus is sufficient to build robust TTS models for real-world applications, with a subjective mean opinion score ranging from 3.6 to 4.2 for all the five speakers. We believe that our corpus will facilitate speech and language research for Kazakh and other Turkic languages, which are widely considered to be low-resource due to the limited availability of free linguistic data. The constructed corpus, code, and pretrained models are publicly available in our GitHub repository.

MoDELS · 語言模型化 · 可約的 · 講稿 · 可理解性 ·

2022 年 4 月 19 日

ALBETO and DistilBETO: Lightweight Spanish Language Models

José Ca?ete,Sebastián Donoso,Felipe Bravo-Marquez,Andrés Carvallo,Vladimir Araujo

from arxiv, Accepted paper at LREC2022

In recent years there have been considerable advances in pre-trained language models, where non-English language versions have also been made available. Due to their increasing use, many lightweight versions of these models (with reduced parameters) have also been released to speed up training and inference times. However, versions of these lighter models (e.g., ALBERT, DistilBERT) for languages other than English are still scarce. In this paper we present ALBETO and DistilBETO, which are versions of ALBERT and DistilBERT pre-trained exclusively on Spanish corpora. We train several versions of ALBETO ranging from 5M to 223M parameters and one of DistilBETO with 67M parameters. We evaluate our models in the GLUES benchmark that includes various natural language understanding tasks in Spanish. The results show that our lightweight models achieve competitive results to those of BETO (Spanish-BERT) despite having fewer parameters. More specifically, our larger ALBETO model outperforms all other models on the MLDoc, PAWS-X, XNLI, MLQA, SQAC and XQuAD datasets. However, BETO remains unbeaten for POS and NER. As a further contribution, all models are publicly available to the community for future research.

entity · 詞元分析器 · Performer · BERT · Extensibility ·

2022 年 4 月 19 日

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

Lefteris Loukas,Manos Fergadiotis,Ilias Chalkidis,Eirini Spyropoulou,Prodromos Malakasiotis,Ion Androutsopoulos,Georgios Paliouras

from arxiv, 13 pages, long paper at ACL 2022

Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT's performance, allowing word-level BILSTMs to perform better. To improve BERT's performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging.

entity · 數據集 · 命名實體識別 · 類別 · 數據獲取 ·

2022 年 4 月 19 日

Named Entity Recognition for Partially Annotated Datasets

Michael Strobl,Amine Trabelsi,Osmar Zaiane

from arxiv, Long version of our short paper accepted at NLDB 2022

The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.

MoDELS · 原點 · 數據集 · state-of-the-art · 訓練數據 ·

2022 年 4 月 17 日

On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?

Nouha Dziri,Sivan Milton,Mo Yu,Osmar Zaiane,Siva Reddy

from arxiv, NAACL 2022, 14 pages

Knowledge-grounded conversational models are known to suffer from producing factually invalid statements, a phenomenon commonly called hallucination. In this work, we investigate the underlying causes of this phenomenon: is hallucination due to the training data, or to the models? We conduct a comprehensive human study on both existing knowledge-grounded conversational benchmarks and several state-of-the-art models. Our study reveals that the standard benchmarks consist of >60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations. Our findings raise important questions on the quality of existing datasets and models trained using them. We make our annotations publicly available for future research.

命名實體識別 · entity · 學成 · 深度學習 · 可辨認的 ·

2020 年 3 月 13 日

A Survey on Deep Learning for Named Entity Recognition

Jing Li,Aixin Sun,Jianglei Han,Chenliang Li

from arxiv, 20 pages, 12 figures, 3 tables. arXiv admin note: text overlap with arXiv:1702.02098, arXiv:1904.10503 by other authors

Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.

掩碼 · BERT · MoDELS · 掩碼語言模型化 · Extensibility ·

2019 年 6 月 19 日

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui,Wanxiang Che,Ting Liu,Bing Qin,Ziqing Yang,Shijin Wang,Guoping Hu

from arxiv, 10 pages

Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks. Recently, an upgraded version of BERT has been released with Whole Word Masking (WWM), which mitigate the drawbacks of masking partial WordPiece tokens in pre-training BERT. In this technical report, we adapt whole word masking in Chinese text, that masking the whole word instead of masking Chinese characters, which could bring another challenge in Masked Language Model (MLM) pre-training task. The model was trained on the latest Chinese Wikipedia dump. We aim to provide easy extensibility and better performance for Chinese BERT without changing any neural architecture or even hyper-parameters. The model is verified on various NLP tasks, across sentence-level to document-level, including sentiment classification (ChnSentiCorp, Sina Weibo), named entity recognition (People Daily, MSRA-NER), natural language inference (XNLI), sentence pair matching (LCQMC, BQ Corpus), and machine reading comprehension (CMRC 2018, DRCD, CAIL RC). Experimental results on these datasets show that the whole word masking could bring another significant gain. Moreover, we also examine the effectiveness of Chinese pre-trained models: BERT, ERNIE, BERT-wwm. We release the pre-trained model (both TensorFlow and PyTorch) on GitHub: //github.com/ymcui/Chinese-BERT-wwm

entity · 命名實體識別 · Networking · 卷積 · Extensibility ·

2019 年 4 月 3 日

CAN-NER: Convolutional Attention Network forChinese Named Entity Recognition

Yuying Zhu,Guoxin Wang,B?rje F. Karlsson

Named entity recognition (NER) in Chinese is essential but difficult because of the lack of natural delimiters. Therefore, Chinese Word Segmentation (CWS) is usually considered as the first step for Chinese NER. However, models based on word-level embeddings and lexicon features often suffer from segmentation errors and out-of-vocabulary (OOV) words. In this paper, we investigate a Convolutional Attention Network called CAN for Chinese NER, which consists of a character-based convolutional neural network (CNN) with local-attention layer and a gated recurrent unit (GRU) with global self-attention layer to capture the information from adjacent characters and sentence contexts. Also, compared to other models, not depending on any external resources like lexicons and employing small size of char embeddings make our model more practical. Extensive experimental results show that our approach outperforms state-of-the-art methods without word embedding and external lexicon resources on different domain datasets including Weibo, MSRA and Chinese Resume NER dataset.

情感分析 · MoDELS · 循環神經網絡 · entity · Neural Networks ·

2018 年 6 月 8 日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Ethem F. Can,Aysu Ezen-Can,Fazli Can

from arxiv, ACM SIGIR 2018 Workshop on Learning from Limited or Noisy Data (LND4IR'18)

Sentiment analysis is a widely studied NLP task where the goal is to determine opinions, emotions, and evaluations of users towards a product, an entity or a service that they are reviewing. One of the biggest challenges for sentiment analysis is that it is highly language dependent. Word embeddings, sentiment lexicons, and even annotated data are language specific. Further, optimizing models for each language is very time consuming and labor intensive especially for recurrent neural network models. From a resource perspective, it is very challenging to collect data for different languages. In this paper, we look for an answer to the following research question: can a sentiment analysis model trained on a language be reused for sentiment analysis in other languages, Russian, Spanish, Turkish, and Dutch, where the data is more limited? Our goal is to build a single model in the language with the largest dataset available for the task, and reuse it for languages that have limited resources. For this purpose, we train a sentiment analysis model using recurrent neural networks with reviews in English. We then translate reviews in other languages and reuse this model to evaluate the sentiments. Experimental results show that our robust approach of single model trained on English reviews statistically significantly outperforms the baselines in several different languages.

entity · Performer · 命名實體識別 · state-of-the-art · 主動學習 ·

2018 年 2 月 4 日

Deep Active Learning for Named Entity Recognition

Yanyao Shen,Hyokun Yun,Zachary C. Lipton,Yakov Kronrod,Animashree Anandkumar

Deep learning has yielded state-of-the-art performance on many natural language processing tasks including named entity recognition (NER). However, this typically requires large amounts of labeled data. In this work, we demonstrate that the amount of labeled training data can be drastically reduced when deep learning is combined with active learning. While active learning is sample-efficient, it can be computationally expensive since it requires iterative retraining. To speed this up, we introduce a lightweight architecture for NER, viz., the CNN-CNN-LSTM model consisting of convolutional character and word encoders and a long short term memory (LSTM) tag decoder. The model achieves nearly state-of-the-art performance on standard datasets for the task while being computationally much more efficient than best performing models. We carry out incremental active learning, during the training process, and are able to nearly match state-of-the-art performance with just 25\% of the original training data.

閱讀: 0 點贊: 0

小貼士

登錄享

相關主題

命名(ming)實(shi)體識別

北京阿比特科技有限公司

注冊地址：北京市海淀區羊坊店路18號2幢3層301-191

<tr id='6xeyf'><strong id='6xeyf'></strong><small id='6xeyf'></small><button id='6xeyf'></button><li id='6xeyf'><noscript id='6xeyf'><big id='6xeyf'></big><dt id='6xeyf'></dt></noscript></li></tr><ol id='6xeyf'><option id='6xeyf'><table id='6xeyf'><blockquote id='6xeyf'><tbody id='6xeyf'></tbody></blockquote></table></option></ol><u id='6xeyf'></u><kbd id='6xeyf'><kbd id='6xeyf'></kbd></kbd>

<code id='6xeyf'><strong id='6xeyf'></strong></code>

<fieldset id='6xeyf'></fieldset>

<span id='6xeyf'></span>

<ins id='6xeyf'></ins>

<acronym id='6xeyf'><em id='6xeyf'></em><td id='6xeyf'><div id='6xeyf'></div></td></acronym><address id='6xeyf'><big id='6xeyf'><big id='6xeyf'></big><legend id='6xeyf'></legend></big></address>

<i id='6xeyf'><div id='6xeyf'><ins id='6xeyf'></ins></div></i>

<i id='6xeyf'></i>