亚洲精品无码国产爽快A片百度,亚洲成A人片在线观看网站黄,老司机精品免费视频一区二区,线精品视频在线观看2

As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs is their sensitivity to prompt wording -- but interestingly, humans also display sensitivities to instruction changes in the form of response biases. As such, we argue that if LLMs are going to be used to approximate human opinions, it is necessary to investigate the extent to which LLMs also reflect human response biases, if at all. In this work, we use survey design as a case study, where human response biases caused by permutations in wordings of ``prompts'' have been extensively studied. Drawing from prior work in social psychology, we design a dataset and propose a framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior. These inconsistencies tend to be more prominent in models that have been instruction fine-tuned. Furthermore, even if a model shows a significant change in the same direction as humans, we find that perturbations that are not meant to elicit significant changes in humans may also result in a similar change, suggesting that such a result could be partially due to other spurious correlations. These results highlight the potential pitfalls of using LLMs to substitute humans in parts of the annotation pipeline, and further underscore the importance of finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at //github.com/lindiatjuatja/BiasMonkey

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · SimPLe · 準則 · 近似 ·

2023 年 12 月 25 日

Frege's theory of types

Bruno Bentzen

It is often claimed that the theory of function levels proposed by Frege in Grundgesetze der Arithmetik anticipates the hierarchy of types that underlies Church's simple theory of types. This claim roughly states that Frege presupposes a type of functions in the sense of simple type theory in the expository language of Grundgesetze. However, this view makes it hard to accommodate function names of two arguments and view functions as incomplete entities. I propose and defend an alternative interpretation of first-level function names in Grundgesetze into simple type-theoretic open terms rather than into closed terms of a function type. This interpretation offers a still unhistorical but more faithful type-theoretic approximation of Frege's theory of levels and can be naturally extended to accommodate second-level functions. It is made possible by two key observations that Frege's Roman markers behave essentially like open terms and that Frege lacks a clear criterion for distinguishing between Roman markers and function names.

大語言模型 · 語言模型化 · GPT-3.5 · MoDELS · 知識 (knowledge) ·

2023 年 12 月 24 日

Can large language models reason about medical questions?

Valentin Liévin,Christoffer Egeberg Hother,Andreas Geert Motzfeldt,Ole Winther

from arxiv, 37 pages, 23 figures. v1: results using InstructGPT, v2.0: added the Codex experiments, v2.1: added the missing test MedMCQA results for Codex 5-shot CoT and using k=100 samples, v3.0: added results for open source models -- ready for publication (final version)

Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.

ChatGPT · Performer · ReQuEST · 樣例 · Prompt ·

2023 年 12 月 23 日

ChatGPT and post-test probability

Samuel J. Weisenthal

from arxiv, 138 pages, 4 tables

Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a posterior of A given B and C) to queries that use terminology from medical diagnosis (e.g., requests for a posterior probability of Covid given a test result and cough). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.

相互獨立的 · 規范化的 · Performer · HTTPS · 統計量 ·

2023 年 12 月 23 日

Testing multivariate normality by testing independence

Povilas Daniu?is

from arxiv, 6 pages, 1 figure

We propose a simple multivariate normality test based on Kac-Bernstein's characterization, which can be conducted by utilising existing statistical independence tests for sums and differences of data samples. We also perform its empirical investigation, which reveals that for high-dimensional data, the proposed approach may be more efficient than the alternative ones. The accompanying code repository is provided at \url{//shorturl.at/rtuy5}.

Performer · 情景 · 蒙特卡羅 · 統計量 · 分解的 ·

2023 年 12 月 22 日

Combining support for hypotheses over heterogeneous studies with Bayesian Evidence Synthesis: A simulation study

Thom Benjamin Volker,Irene Klugkist

Scientific claims gain credibility by replicability, especially if replication under different circumstances and varying designs yields equivalent results. Aggregating results over multiple studies is, however, not straightforward, and when the heterogeneity between studies increases, conventional methods such as (Bayesian) meta-analysis and Bayesian sequential updating become infeasible. *Bayesian Evidence Synthesis*, built upon the foundations of the Bayes factor, allows to aggregate support for conceptually similar hypotheses over studies, regardless of methodological differences. We assess the performance of Bayesian Evidence Synthesis over multiple effect and sample sizes, with a broad set of (inequality-constrained) hypotheses using Monte Carlo simulations, focusing explicitly on the complexity of the hypotheses under consideration. The simulations show that this method can evaluate complex (informative) hypotheses regardless of methodological differences between studies, and performs adequately if the set of studies considered has sufficient statistical power. Additionally, we pinpoint challenging conditions that can lead to unsatisfactory results, and provide suggestions on handling these situations. Ultimately, we show that Bayesian Evidence Synthesis is a promising tool that can be used when traditional research synthesis methods are not applicable due to insurmountable between-study heterogeneity.

大語言模型 · 語言模型化 · 有偏 · CASES · MoDELS ·

2023 年 12 月 22 日

Use large language models to promote equity

Emma Pierson,Divya Shanmugam,Rajiv Movva,Jon Kleinberg,Monica Agrawal,Mark Dredze,Kadija Ferryman,Judy Wawira Gichoya,Dan Jurafsky,Pang Wei Koh,Karen Levy,Sendhil Mullainathan,Ziad Obermeyer,Harini Suresh,Keyon Vafa

Advances in large language models (LLMs) have driven an explosion of interest about their societal impacts. Much of the discourse around how they will impact social equity has been cautionary or negative, focusing on questions like "how might LLMs be biased and how would we mitigate those biases?" This is a vital discussion: the ways in which AI generally, and LLMs specifically, can entrench biases have been well-documented. But equally vital, and much less discussed, is the more opportunity-focused counterpoint: "what promising applications do LLMs enable that could promote equity?" If LLMs are to enable a more equitable world, it is not enough just to play defense against their biases and failure modes. We must also go on offense, applying them positively to equity-enhancing use cases to increase opportunities for underserved groups and reduce societal discrimination. There are many choices which determine the impact of AI, and a fundamental choice very early in the pipeline is the problems we choose to apply it to. If we focus only later in the pipeline -- making LLMs marginally more fair as they facilitate use cases which intrinsically entrench power -- we will miss an important opportunity to guide them to equitable impacts. Here, we highlight the emerging potential of LLMs to promote equity by presenting four newly possible, promising research directions, while keeping risks and cautionary points in clear view.

規范化的 · CASE · 詞元化 · Pair · MoDELS ·

2023 年 12 月 22 日

Text normalization for low-resource languages: the case of Ligurian

Stefano Lusito,Edoardo Ferrante,Jean Maillard

Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.

prototype · 流形 · DNN · 分離的 · 穩健性 ·

2023 年 12 月 22 日

GROOD: GRadient-aware Out-Of-Distribution detection in interpolated manifolds

Mostafa ElAraby,Sabyasachi Sahoo,Yann Pequignot,Paul Novello,Liam Paull

from arxiv, 11 pages, 5 figures, preprint under review

Deep neural networks (DNNs) often fail silently with over-confident predictions on out-of-distribution (OOD) samples, posing risks in real-world deployments. Existing techniques predominantly emphasize either the feature representation space or the gradient norms computed with respect to DNN parameters, yet they overlook the intricate gradient distribution and the topology of classification regions. To address this gap, we introduce GRadient-aware Out-Of-Distribution detection in interpolated manifolds (GROOD), a novel framework that relies on the discriminative power of gradient space to distinguish between in-distribution (ID) and OOD samples. To build this space, GROOD relies on class prototypes together with a prototype that specifically captures OOD characteristics. Uniquely, our approach incorporates a targeted mix-up operation at an early intermediate layer of the DNN to refine the separation of gradient spaces between ID and OOD samples. We quantify OOD detection efficacy using the distance to the nearest neighbor gradients derived from the training set, yielding a robust OOD score. Experimental evaluations substantiate that the introduction of targeted input mix-upamplifies the separation between ID and OOD in the gradient space, yielding impressive results across diverse datasets. Notably, when benchmarked against ImageNet-1k, GROOD surpasses the established robustness of state-of-the-art baselines. Through this work, we establish the utility of leveraging gradient spaces and class prototypes for enhanced OOD detection for DNN in image classification.

binary · SLIP · AI · GROUP · 社會計算 ·

2023 年 12 月 21 日

Don't slip into binary thinking about AI

Thorin Bristow,Luke Thorburn

from arxiv, 19 pages

In discussions about the development and governance of AI, a false binary is often drawn between two groups: those most concerned about the existing, social impacts of AI, and those most concerned about possible future risks of powerful AI systems taking actions that don't align with human interests. In this piece, we (i) describe the emergence of this false binary, (ii) explain why the seemingly clean distinctions drawn between these two groups don't hold up under scrutiny and (iii) highlight efforts to bridge this divide.

學成 · Performer · 深度學習 · Processing（編程語言） · 圖像處理 ·

2018 年 7 月 31 日

Deep learning in agriculture: A survey

Andreas Kamilaris,Francesc X. Prenafeta-Boldu

Deep learning constitutes a recent, modern technique for image processing and data analysis, with promising results and large potential. As deep learning has been successfully applied in various domains, it has recently entered also the domain of agriculture. In this paper, we perform a survey of 40 research efforts that employ deep learning techniques, applied to various agricultural and food production challenges. We examine the particular agricultural problems under study, the specific models and frameworks employed, the sources, nature and pre-processing of data used, and the overall performance achieved according to the metrics used at each work under study. Moreover, we study comparisons of deep learning with other existing popular techniques, in respect to differences in classification or regression performance. Our findings indicate that deep learning provides high accuracy, outperforming existing commonly used image processing techniques.