亚洲精品无码国产爽快A片百度_尤物视频一区二区_欧美精品日韩精品国内精品_一级特黄特色毛片免费视频_日本制服丝袜一区二区三区_国产精品久久久一级毛片_美女被男人桶的好爽的黄

Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding and deliver accurate responses. Secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from LLM agents. Thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system. To address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. In the task evaluation, AndroidArena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. Our findings reveal that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. Additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of LLM agents. Furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. This work is the first to present valuable insights in understanding fine-grained weakness of LLM agents, and offers a path forward for future research in this area. Environment, benchmark, and evaluation code for AndroidArena are released at //github.com/AndroidArenaAgent/AndroidArena.

相關內容

大語言(yan)模(mo)型

關注 56

大語(yu)(yu)言(yan)模型是(shi)基于海量(liang)(liang)文(wen)(wen)(wen)本(ben)數(shu)據(ju)訓練的(de)(de)(de)(de)深度學習(xi)模型。它(ta)不(bu)僅能(neng)夠(gou)生成自(zi)然語(yu)(yu)言(yan)文(wen)(wen)(wen)本(ben)，還能(neng)夠(gou)深入理解(jie)(jie)文(wen)(wen)(wen)本(ben)含義，處理各種(zhong)自(zi)然語(yu)(yu)言(yan)任務(wu)，如(ru)文(wen)(wen)(wen)本(ben)摘(zhai)要、問答、翻譯等(deng)。2023年(nian)，大語(yu)(yu)言(yan)模型及其(qi)在人(ren)工智(zhi)能(neng)領域(yu)的(de)(de)(de)(de)應(ying)用已(yi)成為全(quan)球科(ke)技研究的(de)(de)(de)(de)熱點，其(qi)在規模上的(de)(de)(de)(de)增(zeng)長尤(you)為引(yin)人(ren)注目(mu)，參(can)數(shu)量(liang)(liang)已(yi)從最初(chu)的(de)(de)(de)(de)十(shi)幾(ji)億(yi)躍升(sheng)(sheng)到如(ru)今的(de)(de)(de)(de)一萬億(yi)。參(can)數(shu)量(liang)(liang)的(de)(de)(de)(de)提(ti)升(sheng)(sheng)使得模型能(neng)夠(gou)更加精細地捕(bu)捉(zhuo)人(ren)類語(yu)(yu)言(yan)微妙之(zhi)處，更加深入地理解(jie)(jie)人(ren)類語(yu)(yu)言(yan)的(de)(de)(de)(de)復(fu)雜性。在過去(qu)的(de)(de)(de)(de)一年(nian)里(li)，大語(yu)(yu)言(yan)模型在吸納(na)新知識(shi)、分解(jie)(jie)復(fu)雜任務(wu)以及圖(tu)文(wen)(wen)(wen)對(dui)齊(qi)等(deng)多方(fang)面都有顯著提(ti)升(sheng)(sheng)。隨著技術的(de)(de)(de)(de)不(bu)斷成熟(shu)，它(ta)將不(bu)斷拓展其(qi)應(ying)用范圍，為人(ren)類提(ti)供更加智(zhi)能(neng)化和個(ge)性化的(de)(de)(de)(de)服務(wu)，進一步(bu)改(gai)善人(ren)們的(de)(de)(de)(de)生活和生產方(fang)式。

Haskell · 泛函 · Performer · MoDELS · 語言模型化 ·

2024 年 3 月 22 日

Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study

Tim van Dam,Frank van der Heijden,Philippe de Bekker,Berend Nieuwschepen,Marc Otten,Maliheh Izadi

from arxiv, To appear in the First Special Event on AI Foundation Models and Software Engineering (FORGE 2024)

Language model-based code completion models have quickly grown in use, helping thousands of developers write code in many different programming languages. However, research on code completion models typically focuses on imperative languages such as Python and JavaScript, which results in a lack of representation for functional programming languages. Consequently, these models often perform poorly on functional languages such as Haskell. To investigate whether this can be alleviated, we evaluate the performance of two language models for code, CodeGPT and UniXcoder, on the functional programming language Haskell. We fine-tune and evaluate the models on Haskell functions sourced from a publicly accessible Haskell dataset on HuggingFace. Additionally, we manually evaluate the models using our novel translated HumanEval dataset. Our automatic evaluation shows that knowledge of imperative programming languages in the pre-training of LLMs may not transfer well to functional languages, but that code completion on functional languages is feasible. Consequently, this shows the need for more high-quality Haskell datasets. A manual evaluation on HumanEval-Haskell indicates CodeGPT frequently generates empty predictions and extra comments, while UniXcoder more often produces incomplete or incorrect predictions. Finally, we release HumanEval-Haskell, along with the fine-tuned models and all code required to reproduce our experiments on GitHub (//github.com/AISE-TUDelft/HaskellCCEval).

Performer · MoDELS · 語言模型化 · 大語言模型 · 泛函 ·

2024 年 3 月 22 日

Construction of a Japanese Financial Benchmark for Large Language Models

Masanori Hirano

from arxiv, 9 pages, Joint Workshop of the 7th Financial Technology and Natural Language Processing (FinNLP), the 5th Knowledge Discovery from Unstructured Data in Financial Services (KDF), and The 4th Workshop on Economics and Natural Language Processing (ECONLP) In conjunction with LREC-COLING-2024

With the recent development of large language models (LLMs), models that focus on certain domains and languages have been discussed for their necessity. There is also a growing need for benchmarks to evaluate the performance of current LLMs in each domain. Therefore, in this study, we constructed a benchmark comprising multiple tasks specific to the Japanese and financial domains and performed benchmark measurements on some models. Consequently, we confirmed that GPT-4 is currently outstanding, and that the constructed benchmarks function effectively. According to our analysis, our benchmark can differentiate benchmark scores among models in all performance ranges by combining tasks with different difficulties.

語言模型化 · 大語言模型 · MoDELS · AIM · state-of-the-art ·

2024 年 3 月 21 日

Large Language Models for Multi-Choice Question Classification of Medical Subjects

Víctor Ponce-López

The aim of this paper is to evaluate whether large language models trained on multi-choice question data can be used to discriminate between medical subjects. This is an important and challenging task for automatic question answering. To achieve this goal, we train deep neural networks for multi-class classification of questions into the inferred medical subjects. Using our Multi-Question (MQ) Sequence-BERT method, we outperform the state-of-the-art results on the MedMCQA dataset with an accuracy of 0.68 and 0.60 on their development and test sets, respectively. In this sense, we show the capability of AI and LLMs in particular for multi-classification tasks in the Healthcare domain.

圖 · 大語言模型 · 語言模型化 · MoDELS · Extensibility ·

2024 年 3 月 21 日

Exploring the Potential of Large Language Models in Graph Generation

Yang Yao,Xin Wang,Zeyang Zhang,Yijian Qin,Ziwei Zhang,Xu Chu,Yuekui Yang,Wenwu Zhu,Hong Mei

Large language models (LLMs) have achieved great success in many fields, and recent works have studied exploring LLMs for graph discriminative tasks such as node classification. However, the abilities of LLMs for graph generation remain unexplored in the literature. Graph generation requires the LLM to generate graphs with given properties, which has valuable real-world applications such as drug discovery, while tends to be more challenging. In this paper, we propose LLM4GraphGen to explore the ability of LLMs for graph generation with systematical task designs and extensive experiments. Specifically, we propose several tasks tailored with comprehensive experiments to address key questions regarding LLMs' understanding of different graph structure rules, their ability to capture structural type distributions, and their utilization of domain knowledge for property-based graph generation. Our evaluations demonstrate that LLMs, particularly GPT-4, exhibit preliminary abilities in graph generation tasks, including rule-based and distribution-based generation. We also observe that popular prompting methods, such as few-shot and chain-of-thought prompting, do not consistently enhance performance. Besides, LLMs show potential in generating molecules with specific properties. These findings may serve as foundations for designing good LLMs based models for graph generation and provide valuable insights and further research.

MoDELS · 穩健性 · 地球 · 評論員 · Machine Learning ·

2024 年 3 月 21 日

Impact Assessment of Missing Data in Model Predictions for Earth Observation Applications

Francisco Mena,Diego Arenas,Marcela Charfuelan,Marlon Nuske,Andreas Dengel

from arxiv, Accepted at IEEE International Geoscience and Remote Sensing Symposium 2024

Earth observation (EO) applications involving complex and heterogeneous data sources are commonly approached with machine learning models. However, there is a common assumption that data sources will be persistently available. Different situations could affect the availability of EO sources, like noise, clouds, or satellite mission failures. In this work, we assess the impact of missing temporal and static EO sources in trained models across four datasets with classification and regression tasks. We compare the predictive quality of different methods and find that some are naturally more robust to missing data. The Ensemble strategy, in particular, achieves a prediction robustness up to 100%. We evidence that missing scenarios are significantly more challenging in regression than classification tasks. Finally, we find that the optical view is the most critical view when it is missing individually.

穩健性 · MoDELS · 語言模型化 · 大語言模型 · 相似度 ·

2024 年 3 月 21 日

Improving the Robustness of Large Language Models via Consistency Alignment

Zhao Yukun,Yan Lingyong,Sun Weiwei,Xing Guoliang,Wang Shuaiqiang,Meng Chong,Cheng Zhicong,Ren Zhaochun,Yin Dawei

from arxiv, Accepted by LREC-COLING 2024

Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.

Agent · INTERACT · 會話智能體 · GROUP · 值域 ·

2024 年 3 月 21 日

RoleInteract: Evaluating the Social Interaction of Role-Playing Agents

Hongzhan Chen,Hehong Chen,Ming Yan,Wenshen Xu,Xing Gao,Weizhou Shen,Xiaojun Quan,Chenliang Li,Ji Zhang,Fei Huang,Jingren Zhou

Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at //github.com/X-PLUG/RoleInteract.

Performer · 命名實體識別 · entity · 大語言模型 · 語言模型化 ·

2024 年 3 月 21 日

Self-Improving for Zero-Shot Named Entity Recognition with Large Language Models

Tingyu Xie,Qi Li,Yan Zhang,Zuozhu Liu,Hongwei Wang

from arxiv, Accepted to NAACL 2024 (Main Conference)

Exploring the application of powerful large language models (LLMs) on the named entity recognition (NER) task has drawn much attention recently. This work pushes the performance boundary of zero-shot NER with LLMs by proposing a training-free self-improving framework, which utilizes an unlabeled corpus to stimulate the self-learning ability of LLMs. First, we use the LLM to make predictions on the unlabeled corpus using self-consistency and obtain a self-annotated dataset. Second, we explore various strategies to select reliable annotations to form a reliable self-annotated dataset. Finally, for each test input, we retrieve demonstrations from the reliable self-annotated dataset and perform inference via in-context learning. Experiments on four benchmarks show substantial performance improvements achieved by our framework. Through comprehensive experimental analysis, we find that increasing the size of unlabeled corpus or iterations of self-improving does not guarantee further improvement, but the performance might be boosted via more advanced strategies for reliable annotation selection. Code and data are publicly available at //github.com/Emma1066/Self-Improve-Zero-Shot-NER

MoDELS · 代碼 · 語言模型化 · 大語言模型 · 數據集 ·

2024 年 3 月 20 日

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Federico Cassano,Luisa Li,Akul Sethi,Noah Shinn,Abby Brennan-Jones,Jacob Ginesin,Edward Berman,George Chakhnashvili,Anton Lozhkov,Carolyn Jane Anderson,Arjun Guha

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at //github.com/nuprl/CanItEdit.

語言模型化 · MoDELS · Taxonomy · AIM · 散度 ·

2023 年 9 月 3 日

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang,Yafu Li,Leyang Cui,Deng Cai,Lemao Liu,Tingchen Fu,Xinting Huang,Enbo Zhao,Yu Zhang,Yulong Chen,Longyue Wang,Anh Tuan Luu,Wei Bi,Freda Shi,Shuming Shi

from arxiv, work in progress; 32 pages

While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.