四虎亚洲精品高清在线观看_天天插天天摸久久久_精品无码国产一区二区二百信息网_亚洲国产中文欧美日本精品_亚洲三级片在线观看视频_在线插放黄色视频_日韩精品欧美国产第一页

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

相關內容

大語(yu)言模型

關注 56

大語(yu)(yu)言(yan)(yan)模型(xing)是基于海量(liang)(liang)文本(ben)數據(ju)訓練的(de)(de)深度(du)學習(xi)模型(xing)。它不僅能(neng)夠生(sheng)成自然語(yu)(yu)言(yan)(yan)文本(ben)，還能(neng)夠深入理(li)解(jie)文本(ben)含義，處理(li)各種自然語(yu)(yu)言(yan)(yan)任務，如(ru)文本(ben)摘要、問答、翻(fan)譯等。2023年，大語(yu)(yu)言(yan)(yan)模型(xing)及其(qi)在(zai)人(ren)(ren)工智(zhi)能(neng)領域的(de)(de)應用(yong)已成為全(quan)球科(ke)技(ji)研究的(de)(de)熱(re)點(dian)，其(qi)在(zai)規模上的(de)(de)增長尤為引(yin)人(ren)(ren)注目，參(can)數量(liang)(liang)已從最初(chu)的(de)(de)十幾億(yi)躍(yue)升(sheng)到如(ru)今的(de)(de)一萬億(yi)。參(can)數量(liang)(liang)的(de)(de)提升(sheng)使(shi)得模型(xing)能(neng)夠更(geng)(geng)加(jia)精細地捕捉(zhuo)人(ren)(ren)類(lei)語(yu)(yu)言(yan)(yan)微(wei)妙之處，更(geng)(geng)加(jia)深入地理(li)解(jie)人(ren)(ren)類(lei)語(yu)(yu)言(yan)(yan)的(de)(de)復(fu)雜(za)性。在(zai)過去的(de)(de)一年里，大語(yu)(yu)言(yan)(yan)模型(xing)在(zai)吸納新知識、分解(jie)復(fu)雜(za)任務以(yi)及圖文對齊等多方(fang)面都有(you)顯著(zhu)提升(sheng)。隨著(zhu)技(ji)術的(de)(de)不斷成熟(shu)，它將(jiang)不斷拓展(zhan)其(qi)應用(yong)范(fan)圍，為人(ren)(ren)類(lei)提供(gong)更(geng)(geng)加(jia)智(zhi)能(neng)化和個性化的(de)(de)服務，進一步改善人(ren)(ren)們的(de)(de)生(sheng)活和生(sheng)產方(fang)式。

置信度 · MoDELS · 語言模型化 · 大語言模型 · 可理解性 ·

2024 年 6 月 3 日

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

Abhishek Kumar,Robert Morabito,Sanzhar Umbet,Jad Kabbara,Ali Emami

from arxiv, 9 pages (excluding references), accepted to ACL 2024 Main Conference

As the use of Large Language Models (LLMs) becomes more widespread, understanding their self-evaluation of confidence in generated responses becomes increasingly important as it is integral to the reliability of the output of these models. We introduce the concept of Confidence-Probability Alignment, that connects an LLM's internal confidence, quantified by token probabilities, to the confidence conveyed in the model's response when explicitly asked about its certainty. Using various datasets and prompting techniques that encourage model introspection, we probe the alignment between models' internal and expressed confidence. These techniques encompass using structured evaluation scales to rate confidence, including answer options when prompting, and eliciting the model's confidence level for outputs it does not recognize as its own. Notably, among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment, with an average Spearman's $\hat{\rho}$ of 0.42, across a wide range of tasks. Our work contributes to the ongoing efforts to facilitate risk assessment in the application of LLMs and to further our understanding of model trustworthiness.

樣本 · 估計/估計量 · HTTPS · 判別器 · Machine Learning ·

2024 年 6 月 3 日

Feature Attribution with Necessity and Sufficiency via Dual-stage Perturbation Test for Causal Explanation

Xuexin Chen,Ruichu Cai,Zhengting Huang,Yuxuan Zhu,Julien Horwood,Zhifeng Hao,Zijian Li,Jose Miguel Hernandez-Lobato

from arxiv, Accepted in the Proceedings of the 41st International Conference on Machine Learning (ICML2024)

We investigate the problem of explainability for machine learning models, focusing on Feature Attribution Methods (FAMs) that evaluate feature importance through perturbation tests. Despite their utility, FAMs struggle to distinguish the contributions of different features, when their prediction changes are similar after perturbation. To enhance FAMs' discriminative power, we introduce Feature Attribution with Necessity and Sufficiency (FANS), which find a neighborhood of the input such that perturbing samples within this neighborhood have a high Probability of being Necessity and Sufficiency (PNS) cause for the change in predictions, and use this PNS as the importance of the feature. Specifically, FANS compute this PNS via a heuristic strategy for estimating the neighborhood and a perturbation test involving two stages (factual and interventional) for counterfactual reasoning. To generate counterfactual samples, we use a resampling-based approach on the observed samples to approximate the required conditional distribution. We demonstrate that FANS outperforms existing attribution methods on six benchmarks. Please refer to the source code via \url{//github.com/DMIRLAB-Group/FANS}.

Agent · Prompt · 輸出 · MoDELS · 操作 ·

2024 年 6 月 3 日

The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative

Zhen Tan,Chengshuai Zhao,Raha Moraffah,Yifan Li,Yu Kong,Tianlong Chen,Huan Liu

from arxiv, Accepted to workshop on ReGenAI@CVPR 2024

Due to their unprecedented ability to process and respond to various types of data, Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI). As these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. Our paper, ``The Wolf Within'', explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content. Unlike direct harmful output generation for MLLMs, our research demonstrates how a single MLLM agent can be subtly influenced to generate prompts that, in turn, induce other MLLM agents in the society to output malicious content. Our findings reveal that, an MLLM agent, when manipulated to produce specific prompts or instructions, can effectively ``infect'' other agents within a society of MLLMs. This infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. We also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. This research provides a critical insight into a new dimension of threat posed by MLLMs, where a single agent can act as a catalyst for widespread malevolent influence. Our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within MLLM societies, ensuring their safe and ethical utilization in societal applications.

語言模型化 · 大語言模型 · MoDELS · GPT-4 · Guidance ·

2024 年 6 月 2 日

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

Miao Li,Ming-Bin Chen,Bo Tang,Shengbin Hou,Pengyu Wang,Haiying Deng,Zhiyu Li,Feiyu Xiong,Keming Mao,Peng Cheng,Yi Luo

from arxiv, Long paper, ACL 2024 Main

We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.

聯合樹算法 · 推斷 · Extensibility · 條件獨立的 · 相互獨立的 ·

2024 年 5 月 31 日

On the Completeness and Complexity of the Lifted Dynamic Junction Tree Algorithm

Marcel Gehrke

from arxiv, StaRAI 2021

For static lifted inference algorithms, completeness, i.e., domain liftability, is extensively studied. However, so far no domain liftability results for temporal lifted inference algorithms exist. In this paper, we close this gap. More precisely, we contribute the first completeness and complexity analysis for a temporal lifted algorithm, the socalled lifted dynamic junction tree algorithm (LDJT), which is the only exact lifted temporal inference algorithm out there. To handle temporal aspects efficiently, LDJT uses conditional independences to proceed in time, leading to restrictions w.r.t. elimination orders. We show that these restrictions influence the domain liftability results and show that one particular case while proceeding in time, has to be excluded from FO12 . Additionally, for the complexity of LDJT, we prove that the lifted width is in even more cases smaller than the corresponding treewidth in comparison to static inference.

Shuffle · 泛函 · 凸函數 · Performer · ONCE ·

2024 年 5 月 30 日

On the Last-Iterate Convergence of Shuffling Gradient Methods

Zijian Liu,Zhengyuan Zhou

from arxiv, ICML 2024

Shuffling gradient methods are widely implemented in practice, particularly including three popular algorithms: Random Reshuffle (RR), Shuffle Once (SO), and Incremental Gradient (IG). Compared to the empirical success, the theoretical guarantee of shuffling gradient methods was not well-understood for a long time. Until recently, the convergence rates had just been established for the average iterate for convex functions and the last iterate for strongly convex problems (using squared distance as the metric). However, when using the function value gap as the convergence criterion, existing theories cannot interpret the good performance of the last iterate in different settings (e.g., constrained optimization). To bridge this gap between practice and theory, we prove the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value even without strong convexity. Our new results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.

MoDELS · 大語言模型 · INFORMS · 可辨認的 · 真實值 ·

2024 年 5 月 30 日

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan,John Hughes,Dan Valentine,Laura Ruis,Kshitij Sachan,Ansh Radhakrishnan,Edward Grefenstette,Samuel R. Bowman,Tim Rockt?schel,Ethan Perez

from arxiv, For code please check: //github.com/ucl-dark/llm_debate

Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.

知識 (knowledge) · 自動問答 · 基 · 知識庫 · 語言模型化 ·

2024 年 5 月 30 日

ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models

Haoran Luo,Haihong E,Zichen Tang,Shiyao Peng,Yikai Guo,Wentai Zhang,Chenghao Ma,Guanting Dong,Meina Song,Wei Lin,Yifan Zhu,Luu Anh Tuan

from arxiv, Accepted by Findings of ACL 2024

Knowledge Base Question Answering (KBQA) aims to answer natural language questions over large-scale knowledge bases (KBs), which can be summarized into two crucial steps: knowledge retrieval and semantic parsing. However, three core challenges remain: inefficient knowledge retrieval, mistakes of retrieval adversely impacting semantic parsing, and the complexity of previous KBQA methods. To tackle these challenges, we introduce ChatKBQA, a novel and simple generate-then-retrieve KBQA framework, which proposes first generating the logical form with fine-tuned LLMs, then retrieving and replacing entities and relations with an unsupervised retrieval method, to improve both generation and retrieval more directly. Experimental results show that ChatKBQA achieves new state-of-the-art performance on standard KBQA datasets, WebQSP, and CWQ. This work can also be regarded as a new paradigm for combining LLMs with knowledge graphs (KGs) for interpretable and knowledge-required question answering. Our code is publicly available.

知識 (knowledge) · Processing（編程語言） · 圖 · NLP · 知識圖譜 ·

2022 年 9 月 30 日

A Decade of Knowledge Graphs in Natural Language Processing: A Survey

Phillip Schneider,Tim Schopf,Juraj Vladika,Mikhail Galkin,Elena Simperl,Florian Matthes

from arxiv, Accepted to AACL-IJCNLP 2022

In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work.

entity · MINE · 可約的 · 規范化的 · 實體對齊 ·

2021 年 3 月 29 日

Boosting the Speed of Entity Alignment 10*: Dual Attention Matching Network with Normalized Hard Sample Mining

Xin Mao,Wenting Wang,Yuanbin Wu,Man Lan

from arxiv, 12 pages; Accepted by TheWebConf(WWW) 2021

Seeking the equivalent entities among multi-source Knowledge Graphs (KGs) is the pivotal step to KGs integration, also known as \emph{entity alignment} (EA). However, most existing EA methods are inefficient and poor in scalability. A recent summary points out that some of them even require several days to deal with a dataset containing 200,000 nodes (DWY100K). We believe over-complex graph encoder and inefficient negative sampling strategy are the two main reasons. In this paper, we propose a novel KG encoder -- Dual Attention Matching Network (Dual-AMN), which not only models both intra-graph and cross-graph information smartly, but also greatly reduces computational complexity. Furthermore, we propose the Normalized Hard Sample Mining Loss to smoothly select hard negative samples with reduced loss shift. The experimental results on widely used public datasets indicate that our method achieves both high accuracy and high efficiency. On DWY100K, the whole running process of our method could be finished in 1,100 seconds, at least 10* faster than previous work. The performances of our method also outperform previous works across all datasets, where Hits@1 and MRR have been improved from 6% to 13%.