As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs is their sensitivity to prompt wording -- but interestingly, humans also display sensitivities to instruction changes in the form of response biases. As such, we argue that if LLMs are going to be used to approximate human opinions, it is necessary to investigate the extent to which LLMs also reflect human response biases, if at all. In this work, we use survey design as a case study, where human response biases caused by permutations in wordings of ``prompts'' have been extensively studied. Drawing from prior work in social psychology, we design a dataset and propose a framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior. These inconsistencies tend to be more prominent in models that have been instruction fine-tuned. Furthermore, even if a model shows a significant change in the same direction as humans, we find that perturbations that are not meant to elicit significant changes in humans may also result in a similar change, suggesting that such a result could be partially due to other spurious correlations. These results highlight the potential pitfalls of using LLMs to substitute humans in parts of the annotation pipeline, and further underscore the importance of finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at //github.com/lindiatjuatja/BiasMonkey
It is often claimed that the theory of function levels proposed by Frege in Grundgesetze der Arithmetik anticipates the hierarchy of types that underlies Church's simple theory of types. This claim roughly states that Frege presupposes a type of functions in the sense of simple type theory in the expository language of Grundgesetze. However, this view makes it hard to accommodate function names of two arguments and view functions as incomplete entities. I propose and defend an alternative interpretation of first-level function names in Grundgesetze into simple type-theoretic open terms rather than into closed terms of a function type. This interpretation offers a still unhistorical but more faithful type-theoretic approximation of Frege's theory of levels and can be naturally extended to accommodate second-level functions. It is made possible by two key observations that Frege's Roman markers behave essentially like open terms and that Frege lacks a clear criterion for distinguishing between Roman markers and function names.
Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.
Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a posterior of A given B and C) to queries that use terminology from medical diagnosis (e.g., requests for a posterior probability of Covid given a test result and cough). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.
We propose a simple multivariate normality test based on Kac-Bernstein's characterization, which can be conducted by utilising existing statistical independence tests for sums and differences of data samples. We also perform its empirical investigation, which reveals that for high-dimensional data, the proposed approach may be more efficient than the alternative ones. The accompanying code repository is provided at \url{//shorturl.at/rtuy5}.
Scientific claims gain credibility by replicability, especially if replication under different circumstances and varying designs yields equivalent results. Aggregating results over multiple studies is, however, not straightforward, and when the heterogeneity between studies increases, conventional methods such as (Bayesian) meta-analysis and Bayesian sequential updating become infeasible. *Bayesian Evidence Synthesis*, built upon the foundations of the Bayes factor, allows to aggregate support for conceptually similar hypotheses over studies, regardless of methodological differences. We assess the performance of Bayesian Evidence Synthesis over multiple effect and sample sizes, with a broad set of (inequality-constrained) hypotheses using Monte Carlo simulations, focusing explicitly on the complexity of the hypotheses under consideration. The simulations show that this method can evaluate complex (informative) hypotheses regardless of methodological differences between studies, and performs adequately if the set of studies considered has sufficient statistical power. Additionally, we pinpoint challenging conditions that can lead to unsatisfactory results, and provide suggestions on handling these situations. Ultimately, we show that Bayesian Evidence Synthesis is a promising tool that can be used when traditional research synthesis methods are not applicable due to insurmountable between-study heterogeneity.
Advances in large language models (LLMs) have driven an explosion of interest about their societal impacts. Much of the discourse around how they will impact social equity has been cautionary or negative, focusing on questions like "how might LLMs be biased and how would we mitigate those biases?" This is a vital discussion: the ways in which AI generally, and LLMs specifically, can entrench biases have been well-documented. But equally vital, and much less discussed, is the more opportunity-focused counterpoint: "what promising applications do LLMs enable that could promote equity?" If LLMs are to enable a more equitable world, it is not enough just to play defense against their biases and failure modes. We must also go on offense, applying them positively to equity-enhancing use cases to increase opportunities for underserved groups and reduce societal discrimination. There are many choices which determine the impact of AI, and a fundamental choice very early in the pipeline is the problems we choose to apply it to. If we focus only later in the pipeline -- making LLMs marginally more fair as they facilitate use cases which intrinsically entrench power -- we will miss an important opportunity to guide them to equitable impacts. Here, we highlight the emerging potential of LLMs to promote equity by presenting four newly possible, promising research directions, while keeping risks and cautionary points in clear view.
Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
Deep neural networks (DNNs) often fail silently with over-confident predictions on out-of-distribution (OOD) samples, posing risks in real-world deployments. Existing techniques predominantly emphasize either the feature representation space or the gradient norms computed with respect to DNN parameters, yet they overlook the intricate gradient distribution and the topology of classification regions. To address this gap, we introduce GRadient-aware Out-Of-Distribution detection in interpolated manifolds (GROOD), a novel framework that relies on the discriminative power of gradient space to distinguish between in-distribution (ID) and OOD samples. To build this space, GROOD relies on class prototypes together with a prototype that specifically captures OOD characteristics. Uniquely, our approach incorporates a targeted mix-up operation at an early intermediate layer of the DNN to refine the separation of gradient spaces between ID and OOD samples. We quantify OOD detection efficacy using the distance to the nearest neighbor gradients derived from the training set, yielding a robust OOD score. Experimental evaluations substantiate that the introduction of targeted input mix-upamplifies the separation between ID and OOD in the gradient space, yielding impressive results across diverse datasets. Notably, when benchmarked against ImageNet-1k, GROOD surpasses the established robustness of state-of-the-art baselines. Through this work, we establish the utility of leveraging gradient spaces and class prototypes for enhanced OOD detection for DNN in image classification.
In discussions about the development and governance of AI, a false binary is often drawn between two groups: those most concerned about the existing, social impacts of AI, and those most concerned about possible future risks of powerful AI systems taking actions that don't align with human interests. In this piece, we (i) describe the emergence of this false binary, (ii) explain why the seemingly clean distinctions drawn between these two groups don't hold up under scrutiny and (iii) highlight efforts to bridge this divide.
Deep learning constitutes a recent, modern technique for image processing and data analysis, with promising results and large potential. As deep learning has been successfully applied in various domains, it has recently entered also the domain of agriculture. In this paper, we perform a survey of 40 research efforts that employ deep learning techniques, applied to various agricultural and food production challenges. We examine the particular agricultural problems under study, the specific models and frameworks employed, the sources, nature and pre-processing of data used, and the overall performance achieved according to the metrics used at each work under study. Moreover, we study comparisons of deep learning with other existing popular techniques, in respect to differences in classification or regression performance. Our findings indicate that deep learning provides high accuracy, outperforming existing commonly used image processing techniques.