亚洲欧洲日产国-日韩一区二区综合精品

We present Camelira, a web-based Arabic multi-dialect morphological disambiguation tool that covers four major variants of Arabic: Modern Standard Arabic, Egyptian, Gulf, and Levantine. Camelira offers a user-friendly web interface that allows researchers and language learners to explore various linguistic information, such as part-of-speech, morphological features, and lemmas. Our system also provides an option to automatically choose an appropriate dialect-specific disambiguator based on the prediction of a dialect identification component. Camelira is publicly accessible at //camelira.camel-lab.com.

相關內容

詞(ci)性(xing)標(biao)注

關注 389

詞(ci)(ci)性(xing)（part-of-speech）是詞(ci)(ci)匯基(ji)本(ben)的(de)(de)語(yu)(yu)(yu)法屬性(xing)，通常也(ye)稱為(wei)詞(ci)(ci)類。詞(ci)(ci)性(xing)標注(zhu)(zhu)就是在給定句子中(zhong)(zhong)判定每個詞(ci)(ci)的(de)(de)語(yu)(yu)(yu)法范疇，確(que)定其詞(ci)(ci)性(xing)并(bing)加以標注(zhu)(zhu)的(de)(de)過(guo)程(cheng)，是中(zhong)(zhong)文信息處理面臨的(de)(de)重要基(ji)礎性(xing)問(wen)題。在語(yu)(yu)(yu)料庫語(yu)(yu)(yu)言學中(zhong)(zhong)，詞(ci)(ci)性(xing)標注(zhu)(zhu)（POS標注(zhu)(zhu)或PoS標注(zhu)(zhu)或POST），也(ye)稱為(wei)語(yu)(yu)(yu)法標注(zhu)(zhu)，是將文本(ben)（語(yu)(yu)(yu)料庫）中(zhong)(zhong)的(de)(de)單(dan)詞(ci)(ci)標注(zhu)(zhu)為(wei)與特(te)定詞(ci)(ci)性(xing)相對(dui)應的(de)(de)過(guo)程(cheng)，[1] 基(ji)于其定義(yi)和上下文。

Automator · 代碼 · 論文 ·

2023 年 2 月 2 日

A Framework for the Automated Verification of Algebraic Effects and Handlers (extended version)

Tiago Soares,Mário Pereira

Algebraic effects and handlers are a powerful abstraction to build non-local control-flow mechanisms such as resumable exceptions, lightweight threads, co-routines, generators, and asynchronous I/O. All of such features have very evolved semantics, hence they pose very interesting challenges to deductive verification techniques. In fact, there are very few proposed techniques to deductively verify programs featuring these constructs, even fewer when it comes to automated proofs. In this paper, we outline some of the currently available techniques for the verification of programs with algebraic effects. We then build off them to create a mostly automated verification framework by extending Cameleer, a tool which verifies OCaml code using GOSPEL and Why3. This framework embeds the behavior of effects and handlers using exceptions and defunctionalized functions.

成比例 · 數學 · 可理解性 · TOOLS · 泛函 ·

2023 年 2 月 2 日

Proportional algebras

Christian Anti?

Analogical reasoning is at the core of human and artificial intelligence and creativity. Analogical proportions are expressions of the form ``$a$ is to $b$ what $c$ is to $d$'' which are at the center of analogical reasoning which itself is at the core of artificial intelligence with numerous applications. This paper introduces proportional algebras as algebras endowed with a 4-ary analogical proportion relation $a:b:\,:c:d$ satisfying a suitable set of axioms. Functions preserving analogical proportions have already proven to be of practical interest and studying their mathematical properties is essential for understanding proportions. We therefore introduce proportional homomorphisms (and their associated congruences) and functors and show that they are closely related notions. This provides us with mathematical tools for transferring knowledge across different domains which is crucial for future AI-systems. In a broader sense, this paper is a further step towards a mathematical theory of analogical reasoning.

INFORMS · Processing（編程語言） · NLP · 正則的 · SimPLe ·

2023 年 2 月 1 日

On the Role of Morphological Information for Contextual Lemmatization

Olia Toporkov,Rodrigo Agerri

from arxiv, 24 pages, 6 figures, 6 tables

Lemmatization is a Natural Language Processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without analyzing whether that is the optimum in terms of downstream performance. Thus, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising: (i) providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages; (ii) in fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain good contextual lemmatizers without seeing any explicit morphological signal; (iii) the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology; (iv) current evaluation practices for lemmatization are not adequate to clearly discriminate between models.

Analysis · Networking · Twitter · INTERACT · MINE ·

2023 年 2 月 1 日

Bibliometric and social network analysis on the use of satellite imagery in agriculture: an entropy-based approach

Riccardo Dainelli,Fabio Saracco

from arxiv, 30 pages, 13 figures; submitted to Agronomy

Satellite imagery is gaining popularity as a valuable tool to lower the impact on natural resources and increase profits for farmers. The purpose of this study is twofold: to mine the scientific literature for revealing the structure of this research domain and to investigate to what extent scientific results are able to reach a wider public. To fulfill these, respectively, a Web of Science and a Twitter dataset were retrieved and analysed. Regarding academic literature, different performances of the various countries were observed: the USA and China resulted as the leading actors, both in terms of published papers and employed researchers. Among the categorised keywords, "resolution", "Landsat", "yield", "wheat" and "multispectral" are the most used. Then, analysing the semantic network of the words used in the various abstracts, the different facets of the research in satellite remote sensing were detected. It emerged the importance of retrieving meteorological parameters through remote sensing and the broad use of vegetation indexes. As emerging topics, classification tasks for land use assessment and crop recognition stand out, together with the use of hyperspectral sensors. Regarding the interaction of academia with the public, the analysis showed that it is practically absent on Twitter: most of the activity therein is due to private companies advertising their business. Therefore, there is still a communication gap between academia and actors from other societal sectors.

分解 · 語言模型化 · MoDELS · INFORMS · Performer ·

2023 年 1 月 31 日

Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning

Yunhu Ye,Binyuan Hui,Min Yang,Binhua Li,Fei Huang,Yongbin Li

Table-based reasoning has shown remarkable progress in combining deep models with discrete reasoning, which requires reasoning over both free-form natural language (NL) questions and structured tabular data. However, previous table-based reasoning solutions usually suffer from significant performance degradation on huge evidence (tables). In addition, most existing methods struggle to reason over complex questions since the required information is scattered in different places. To alleviate the above challenges, we exploit large language models (LLMs) as decomposers for effective table-based reasoning, which (i) decompose huge evidence (a huge table) into sub-evidence (a small table) to mitigate the interference of useless information for table reasoning; and (ii) decompose complex questions into simpler sub-questions for text reasoning. Specifically, we first use the LLMs to break down the evidence (tables) involved in the current question, retaining the relevant evidence and excluding the remaining irrelevant evidence from the huge table. In addition, we propose a "parsing-execution-filling" strategy to alleviate the hallucination dilemma of the chain of thought by decoupling logic and numerical computation in each step. Extensive experiments show that our method can effectively leverage decomposed evidence and questions and outperforms the strong baselines on TabFact, WikiTableQuestion, and FetaQA datasets. Notably, our model outperforms human performance for the first time on the TabFact dataset.

泛化理論 · Learning · Networking · 示例 · MoDELS ·

2023 年 1 月 30 日

Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Emmanuel Abbe,Samy Bengio,Aryo Lotfi,Kevin Rizk

from arxiv, 37 pages, 10 figures

This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator (MDI) is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky MDIs. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.

Analysis · 可理解性 · 分解的 · Better · 詞表 ·

2023 年 1 月 29 日

Linguistic Analysis using Paninian System of Sounds and Finite State Machines

Shreekanth M Prabhu,Abhisek Midye

from arxiv, 37 Pages, 16 Figures, 23 Tables

The study of spoken languages comprises phonology, morphology, and grammar. Analysis of a language can be based on its syntax, semantics, and pragmatics. The languages can be classified as root languages, inflectional languages, and stem languages. All these factors lead to the formation of vocabulary which has commonality/similarity as well as distinct and subtle differences across languages. In this paper, we make use of Paninian system of sounds to construct a phonetic map and then words are represented as state transitions on the phonetic map. Each group of related words that cut across languages is represented by a m-language (morphological language). Morphological Finite Automata (MFA) are defined that accept the words belonging to a given m-language. This exercise can enable us to better understand the inter-relationships between words in spoken languages in both language-agnostic and language-cognizant manner.

變換 · Networking · Neural Networks · SimPLe · 深度學習可解釋性 ·

2023 年 1 月 28 日

Transformers Can Be Expressed In First-Order Logic with Majority

William Merrill,Ashish Sabharwal

Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can the inner decision process of neural networks be captured symbolically in some familiar logic? We show that any fixed-precision transformer neural network can be translated into an equivalent fixed-size $\mathsf{FO}(\mathsf{M})$ formula, i.e., a first-order logic formula that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. The proof idea is to design highly uniform boolean threshold circuits that can simulate transformers, and then leverage known theoretical connections between circuits and logic. Our results reveal a surprisingly simple formalism for capturing the behavior of transformers, show that simple problems like integer division are "transformer-hard", and provide valuable insights for comparing transformers to other models like RNNs. Our results suggest that first-order logic with majority may be a useful language for expressing programs extracted from transformers.

可理解性 · GAN · GANs · Better · 生成式對抗網絡 ·

2018 年 12 月 8 日

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

David Bau,Jun-Yan Zhu,Hendrik Strobelt,Bolei Zhou,Joshua B. Tenenbaum,William T. Freeman,Antonio Torralba

from arxiv, 18 pages, 19 figures

Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.

有向 · 注意力機制 · 可理解性 · 模型評估 · Networking ·

2017 年 11 月 20 日

DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

Tao Shen,Tianyi Zhou,Guodong Long,Jing Jiang,Shirui Pan,Chengqi Zhang

from arxiv, 10 pages, 8 figures; Accepted in AAAI-18

Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.