欧美成年黄色网站在线观看,91久久精品美女高潮喷水APP,欧洲美女精品久久久久久久,99久久ER热在这里都是精品99

Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available at //github.com/open-compass/T-Eval.

相關內容

大語言模(mo)型

關注 56

大(da)(da)語(yu)言(yan)模(mo)(mo)(mo)(mo)型是(shi)基(ji)于海量文本(ben)(ben)數(shu)據訓練的(de)(de)(de)(de)(de)(de)深(shen)度學習模(mo)(mo)(mo)(mo)型。它(ta)不(bu)(bu)僅能夠生成自然語(yu)言(yan)文本(ben)(ben)，還(huan)能夠深(shen)入理解文本(ben)(ben)含義，處理各種(zhong)自然語(yu)言(yan)任務(wu)，如(ru)文本(ben)(ben)摘(zhai)要、問答、翻譯等。2023年，大(da)(da)語(yu)言(yan)模(mo)(mo)(mo)(mo)型及其在人工智(zhi)能領域的(de)(de)(de)(de)(de)(de)應用已成為(wei)全球(qiu)科(ke)技(ji)研究的(de)(de)(de)(de)(de)(de)熱點，其在規模(mo)(mo)(mo)(mo)上(shang)的(de)(de)(de)(de)(de)(de)增長尤為(wei)引(yin)人注目(mu)，參數(shu)量已從(cong)最初的(de)(de)(de)(de)(de)(de)十幾億躍升到如(ru)今的(de)(de)(de)(de)(de)(de)一萬億。參數(shu)量的(de)(de)(de)(de)(de)(de)提(ti)(ti)升使得模(mo)(mo)(mo)(mo)型能夠更加精細地(di)捕捉人類(lei)語(yu)言(yan)微妙之處，更加深(shen)入地(di)理解人類(lei)語(yu)言(yan)的(de)(de)(de)(de)(de)(de)復雜性。在過去的(de)(de)(de)(de)(de)(de)一年里(li)，大(da)(da)語(yu)言(yan)模(mo)(mo)(mo)(mo)型在吸納新知識、分解復雜任務(wu)以及圖文對齊等多方面都有顯著提(ti)(ti)升。隨著技(ji)術的(de)(de)(de)(de)(de)(de)不(bu)(bu)斷(duan)(duan)成熟，它(ta)將不(bu)(bu)斷(duan)(duan)拓展其應用范圍，為(wei)人類(lei)提(ti)(ti)供更加智(zhi)能化(hua)和(he)個性化(hua)的(de)(de)(de)(de)(de)(de)服務(wu)，進一步(bu)改善(shan)人們的(de)(de)(de)(de)(de)(de)生活(huo)和(he)生產方式。

有向 · 控制器 · 多樣性 · 大語言模型 · MoDELS ·

2024 年 2 月 28 日

Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards

Haoxiang Wang,Yong Lin,Wei Xiong,Rui Yang,Shizhe Diao,Shuang Qiu,Han Zhao,Tong Zhang

from arxiv, The code and model are released at //github.com/Haoxiang-Wang/directional-preference-alignment

Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).

語言模型化 · 大語言模型 · INFORMS · 知識 (knowledge) · MoDELS ·

2024 年 2 月 27 日

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

Jian Xie,Kai Zhang,Jiangjie Chen,Renze Lou,Yu Su

from arxiv, ICLR 2024 (Spotlight)

By providing external information to large language models (LLMs), tool augmentation (including retrieval augmentation) has emerged as a promising solution for addressing the limitations of LLMs' static parametric memory. However, how receptive are LLMs to such external evidence, especially when the evidence conflicts with their parametric memory? We present the first comprehensive and controlled investigation into the behavior of LLMs when encountering knowledge conflicts. We propose a systematic framework to elicit high-quality parametric memory from LLMs and construct the corresponding counter-memory, which enables us to conduct a series of controlled experiments. Our investigation reveals seemingly contradicting behaviors of LLMs. On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. These results pose important implications that are worth careful consideration for the further development and deployment of tool- and retrieval-augmented LLMs. Resources are available at //github.com/OSU-NLP-Group/LLM-Knowledge-Conflict.

Performer · 語言模型化 · 序列標注 · MoDELS · state-of-the-art ·

2024 年 2 月 27 日

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

David Duki?,Jan ?najder

Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs' poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs' performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.

任務對話系統 · MoDELS · 語言模型化 · 可理解性 · Extensibility ·

2024 年 2 月 27 日

KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark

Seongbo Jang,Seonghyeon Lee,Hwanjo Yu

from arxiv, LREC-COLING 2024

As language models are often deployed as chatbot assistants, it becomes a virtue for models to engage in conversations in a user's first language. While these models are trained on a wide range of languages, a comprehensive evaluation of their proficiency in low-resource languages such as Korean has been lacking. In this work, we introduce KoDialogBench, a benchmark designed to assess language models' conversational capabilities in Korean. To this end, we collect native Korean dialogues on daily topics from public sources, or translate dialogues from other languages. We then structure these conversations into diverse test datasets, spanning from dialogue comprehension to response selection tasks. Leveraging the proposed benchmark, we conduct extensive evaluations and analyses of various language models to measure a foundational understanding of Korean dialogues. Experimental results indicate that there exists significant room for improvement in models' conversation skills. Furthermore, our in-depth comparisons across different language models highlight the effectiveness of recent training techniques in enhancing conversational proficiency. We anticipate that KoDialogBench will promote the progress towards conversation-aware Korean language models.

大語言模型 · 語言模型化 · MoDELS · 模型評估 · CASES ·

2024 年 2 月 26 日

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Hang Jiang,Xiajie Zhang,Xubo Cao,Cynthia Breazeal,Jad Kabbara,Deb Roy

from arxiv, First version uploaded at IC2S2 in May 2023. Full paper submitted in Nov. 2023 and updated Feb. 2024

Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents which we refer to as LLM personas and present a case study with GPT-3.5 and GPT-4 to investigate whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, LLM personas' writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus. Furthermore, human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80\%. Interestingly, the accuracy drops significantly when the annotators were informed of the AI's authorship.

大語言模型 · 語言模型化 · Extensibility · MoDELS · Agent ·

2024 年 2 月 26 日

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments

Junzhe Chen,Xuming Hu,Shuodi Liu,Shiyu Huang,Wei-Wei Tu,Zhaofeng He,Lijie Wen

Recent advancements in large language models (LLMs) have revealed their potential for achieving autonomous agents possessing human-level intelligence. However, existing benchmarks for evaluating LLM Agents either use static datasets, potentially leading to data leakage or focus only on single-agent scenarios, overlooking the complexities of multi-agent interactions. There is a lack of a benchmark that evaluates the diverse capabilities of LLM agents in multi-agent, dynamic environments. To this end, we introduce LLMArena, a novel and easily extensible framework for evaluating the diverse capabilities of LLM in multi-agent dynamic environments. LLMArena encompasses seven distinct gaming environments, employing Trueskill scoring to assess crucial abilities in LLM agents, including spatial reasoning, strategic planning, numerical reasoning, risk assessment, communication, opponent modeling, and team collaboration. We conduct an extensive experiment and human evaluation among different sizes and types of LLMs, showing that LLMs still have a significant journey ahead in their development towards becoming fully autonomous agents, especially in opponent modeling and team collaboration. We hope LLMArena could guide future research towards enhancing these capabilities in LLMs, ultimately leading to more sophisticated and practical applications in dynamic, multi-agent settings. The code and data will be available.

語言模型化 · 大語言模型 · MoDELS · Processing（編程語言） · 可辨認的 ·

2024 年 2 月 26 日

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

Tianyi Tang,Wenyang Luo,Haoyang Huang,Dongdong Zhang,Xiaolei Wang,Xin Zhao,Furu Wei,Ji-Rong Wen

Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora. It remains a challenging problem to explain the underlying mechanisms by which LLMs process multilingual texts. In this paper, we delve into the composition of Transformer architectures in LLMs to pinpoint language-specific regions. Specially, we propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs. Based on LAPE, we conduct comprehensive experiments on two representative LLMs, namely LLaMA-2 and BLOOM. Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons, primarily situated in the models' top and bottom layers. Furthermore, we showcase the feasibility to "steer" the output language of LLMs by selectively activating or deactivating language-specific neurons. Our research provides important evidence to the understanding and exploration of the multilingual capabilities of LLMs.

Extensibility · 可約的 · Less · 語言模型化 · Processing（編程語言） ·

2024 年 2 月 23 日

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Heegyu Kim,Sehyun Yuk,Hyunsouk Cho

from arxiv, under review

Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be easily utilized in real-world service.

MoDELS · CoT · 大語言模型 · 語言模型化 · Prompt ·

2024 年 2 月 23 日

Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models

Che Zhang,Zhenyang Xiao,Chengcheng Han,Yixin Lian,Yuejian Fang

Large language models (LLMs) have made significant strides in reasoning capabilities, with ongoing efforts to refine their reasoning through self-correction. However, recent studies suggest that self-correction can be limited or even counterproductive without external accurate knowledge, raising questions about the limits and effectiveness of self-correction. In this paper, we aim to enhance LLM's self-checking capabilities by meticulously designing training data, thereby improving the accuracy of self-correction. We conduct a detailed analysis of error types in mathematical reasoning and develop a tailored prompt, termed "Step CoT Check". Then we construct a checking-correction dataset for training models. After integrating the original CoT data and checking-correction data for training, we observe that models could improve their self-checking capabilities, thereby enhancing their self-correction capacity and eliminating the need for external feedback or ground truth labels to ascertain the endpoint of correction. We compare the performance of models fine-tuned with the "Step CoT Check" prompt against those refined using other promps within the context of checking-correction data. The "Step CoT Check" outperforms the other two check formats in model with lager parameters, providing more precise feedback thus achieving a higher rate of correctness. For reproducibility, all the datasets and codes are provided in //github.com/bammt/Learn-to-check.

語言模型化 · MoDELS · Taxonomy · AIM · 散度 ·

2023 年 9 月 3 日

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang,Yafu Li,Leyang Cui,Deng Cai,Lemao Liu,Tingchen Fu,Xinting Huang,Enbo Zhao,Yu Zhang,Yulong Chen,Longyue Wang,Anh Tuan Luu,Wei Bi,Freda Shi,Shuming Shi

from arxiv, work in progress; 32 pages

While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.