一级欧美一级日韩大片,日韩无码抽插黄片,亚洲AV永久无码精品古装

Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.

相關內容

規范化的

關注 0

大語言模型 · 語言模型化 · MoDELS · 知識 (knowledge) · Extensibility ·

2024 年 2 月 12 日

Quantitative knowledge retrieval from large language models

David Selby,Kai Spriestersbach,Yuichiro Iwashita,Dennis Bappert,Archana Warrier,Sumantrak Mukherjee,Muhammad Nabeel Asim,Koichi Kise,Sebastian Vollmer

from arxiv, 13 pages plus supplementary materials

Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences, however their utility for quantitative information retrieval is less well understood. In this paper we explore the feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid data analysis tasks such as elicitation of prior distributions for Bayesian models and imputation of missing data. We present a prompt engineering framework, treating an LLM as an interface to a latent space of scientific literature, comparing responses in different contexts and domains against more established approaches. Implications and challenges of using LLMs as 'experts' are discussed.

2024 年 2 月 12 日

Large language models can enhance persuasion through linguistic feature alignment

Minkyu Shin,Jin Kim

Although large language models (LLMs) are reshaping various aspects of human life, our current understanding of their impacts remains somewhat constrained. Here we investigate the impact of LLMs on human communication, using data on consumer complaints in the financial industry. By employing an AI detection tool on more than 820K complaints gathered by the Consumer Financial Protection Bureau (CFPB), we find a sharp increase in the likely use of LLMs shortly after the release of ChatGPT. Moreover, the likely LLM usage was positively correlated with message persuasiveness (i.e., increased likelihood of obtaining relief from financial firms). Computational linguistic analyses suggest that the positive correlation may be explained by LLMs' enhancement of various linguistic features. Based on the results of these observational studies, we hypothesize that LLM usage may enhance a comprehensive set of linguistic features, increasing message persuasiveness to receivers with heterogeneous linguistic preferences (i.e., linguistic feature alignment). We test this hypothesis in preregistered experiments and find support for it. As an instance of early empirical demonstrations of LLM usage for enhancing persuasion, our research highlights the transformative potential of LLMs in human communication.

MoDELS · 視覺問答 · 數據集 · 標記空間 · Performer ·

2024 年 2 月 11 日

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Simon Ging,María A. Bravo,Thomas Brox

from arxiv, Accepted as Spotlight Paper for ICLR 2024. The first two authors contributed equally to this work

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

估計/估計量 · 無偏 · 推斷 · 有偏 · 磁流變材料 ·

2024 年 2 月 10 日

Unbiased estimation and asymptotically valid inference in multivariable Mendelian randomization with many weak instrumental variables

Yihe Yang,Noah Lorincz-Comi,Xiaofeng Zhu

from arxiv, We have observed potential competitors, so we reverted to the version prior to the fourth update (v4). However, this paper and //www.biorxiv.org/content/10.1101/2023.01.10.523480v3.abstract have been merged, with the main content summarized in Supplemental Material 2

Mendelian randomization (MR) is an instrumental variable (IV) approach to infer causal relationships between exposures and outcomes with genome-wide association studies (GWAS) summary data. However, the multivariable inverse-variance weighting (IVW) approach, which serves as the foundation for most MR approaches, cannot yield unbiased causal effect estimates in the presence of many weak IVs. To address this problem, we proposed the MR using Bias-corrected Estimating Equation (MRBEE) that can infer unbiased causal relationships with many weak IVs and account for horizontal pleiotropy simultaneously. While the practical significance of MRBEE was demonstrated in our parallel work (Lorincz-Comi (2023)), this paper established the statistical theories of multivariable IVW and MRBEE with many weak IVs. First, we showed that the bias of the multivariable IVW estimate is caused by the error-in-variable bias, whose scale and direction are inflated and influenced by weak instrument bias and sample overlaps of exposures and outcome GWAS cohorts, respectively. Second, we investigated the asymptotic properties of multivariable IVW and MRBEE, showing that MRBEE outperforms multivariable IVW regarding unbiasedness of causal effect estimation and asymptotic validity of causal inference. Finally, we applied MRBEE to examine myopia and revealed that education and outdoor activity are causal to myopia whereas indoor activity is not.

正則化項 · 泛化理論 · 穩健性 · 示例 · 類別 ·

2024 年 2 月 8 日

Syntactically and semantically regular languages of lambda-terms coincide through logical relations

Vincent Moreau,Lê Thành D?ng Nguyên

from arxiv, The proofs on "finitely pointable" CCCs in versions 1 and 2 were wrong; we now make slightly weaker claims on well-pointed locally finite CCCs. New in this version: added reference [3] and official DOI (proceedings of CSL 2024)

A fundamental theme in automata theory is regular languages of words and trees, and their many equivalent definitions. Salvati has proposed a generalization to regular languages of simply typed $\lambda$-terms, defined using denotational semantics in finite sets. We provide here some evidence for its robustness. First, we give an equivalent syntactic characterization that naturally extends the seminal work of Hillebrand and Kanellakis connecting regular languages of words and syntactic $\lambda$-definability. Second, we show that any finitary extensional model of the simply typed $\lambda$-calculus, when used in Salvati's definition, recognizes exactly the same class of languages of $\lambda$-terms as the category of finite sets does. The proofs of these two results rely on logical relations and can be seen as instances of a more general construction of a categorical nature, inspired by previous categorical accounts of logical relations using the gluing construction.

評論員 · 數據集 · 音素 · 相似度 · Prompt ·

2024 年 2 月 8 日

Phonetically rich corpus construction for a low-resourced language

Marcellus Amadeus,William Alberto Cruz Casta?eda,Wilmer Lobato,Niasche Aquino

Speech technologies rely on capturing a speaker's voice variability while obtaining comprehensive language information. Textual prompts and sentence selection methods have been proposed in the literature to comprise such adequate phonetic data, referred to as a phonetically rich \textit{corpus}. However, they are still insufficient for acoustic modeling, especially critical for languages with limited resources. Hence, this paper proposes a novel approach and outlines the methodological aspects required to create a \textit{corpus} with broad phonetic coverage for a low-resourced language, Brazilian Portuguese. Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution. Furthermore, we propose a new phonemic classification according to acoustic-articulatory speech features since the absolute number of distinct triphones, or low-probability triphones, does not guarantee an adequate representation of every possible combination. Using our algorithm, we achieve a 55.8\% higher percentage of distinct triphones -- for samples of similar size -- while the currently available phonetic-rich corpus, CETUC and TTS-Portuguese, 12.6\% and 12.3\% in comparison to a non-phonetically rich dataset.

LORA · tuning · MoDELS · 泛化理論 · Extensibility ·

2024 年 2 月 8 日

Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

Yunhao Gou,Zhili Liu,Kai Chen,Lanqing Hong,Hang Xu,Aoxue Li,Dit-Yan Yeung,James T. Kwok,Yu Zhang

from arxiv, Project website: //kaichen1998.github.io/projects/mocle/

Instruction tuning of the Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflicts for the same set of model parameters, resulting in sub-optimal instruction-following abilities. To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve the generalization capabilities of MoCLE for novel instructions. Extensive experiments on 10 zero-shot tasks demonstrate the effectiveness of MoCLE.

知識 (knowledge) · INFORMS · 語言表示 · MoDELS · Extensibility ·

2022 年 7 月 28 日

MLRIP: Pre-training a military language representation model with informative factual knowledge and professional knowledge base

Hui Li,Xuekang Yang,Xin Zhao,Lin Yu,Jiping Zheng,Wei Sun

from arxiv, 11 pages, 6 figures

Incorporating prior knowledge into pre-trained language models has proven to be effective for knowledge-driven NLP tasks, such as entity typing and relation extraction. Current pre-training procedures usually inject external knowledge into models by using knowledge masking, knowledge fusion and knowledge replacement. However, factual information contained in the input sentences have not been fully mined, and the external knowledge for injecting have not been strictly checked. As a result, the context information cannot be fully exploited and extra noise will be introduced or the amount of knowledge injected is limited. To address these issues, we propose MLRIP, which modifies the knowledge masking strategies proposed by ERNIE-Baidu, and introduce a two-stage entity replacement strategy. Extensive experiments with comprehensive analyses illustrate the superiority of MLRIP over BERT-based models in military knowledge-driven NLP tasks.

entity · 命名實體識別 · CASE · Extensibility · MoDELS ·

2019 年 11 月 14 日

Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources

Qianhui Wu,Zijia Lin,Guoxin Wang,Hui Chen,B?rje F. Karlsson,Biqing Huang,Chin-Yew Lin

from arxiv, This paper is accepted by AAAI2020

For languages with no annotated resources, transferring knowledge from rich-resource languages is an effective solution for named entity recognition (NER). While all existing methods directly transfer from source-learned model to a target language, in this paper, we propose to fine-tune the learned model with a few similar examples given a test case, which could benefit the prediction by leveraging the structural and semantic information conveyed in such similar examples. To this end, we present a meta-learning algorithm to find a good model parameter initialization that could fast adapt to the given test case and propose to construct multiple pseudo-NER tasks for meta-training by computing sentence similarities. To further improve the model's generalization ability across different languages, we introduce a masking scheme and augment the loss function with an additional maximum term during meta-training. We conduct extensive experiments on cross-lingual named entity recognition with minimal resources over five target languages. The results show that our approach significantly outperforms existing state-of-the-art methods across the board.

學成 · 圖 · Processing（編程語言） · 知識圖譜 · MoDELS ·

2018 年 2 月 16 日

Learning beyond datasets: Knowledge Graph Augmented Neural Networks for Natural language Processing

K M Annervaz,Somnath Basu Roy Chowdhury,Ambedkar Dukkipati

Machine Learning has been the quintessential solution for many AI problems, but learning is still heavily dependent on the specific training data. Some learning models can be incorporated with a prior knowledge in the Bayesian set up, but these learning models do not have the ability to access any organised world knowledge on demand. In this work, we propose to enhance learning models with world knowledge in the form of Knowledge Graph (KG) fact triples for Natural Language Processing (NLP) tasks. Our aim is to develop a deep learning model that can extract relevant prior support facts from knowledge graphs depending on the task using attention mechanism. We introduce a convolution-based model for learning representations of knowledge graph entity and relation clusters in order to reduce the attention space. We show that the proposed method is highly scalable to the amount of prior information that has to be processed and can be applied to any generic NLP task. Using this method we show significant improvement in performance for text classification with News20, DBPedia datasets and natural language inference with Stanford Natural Language Inference (SNLI) dataset. We also demonstrate that a deep learning model can be trained well with substantially less amount of labeled training data, when it has access to organised world knowledge in the form of knowledge graph.