As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs is their sensitivity to prompt wording - but interestingly, humans also display sensitivities to instruction changes in the form of response biases. As such, we argue that if LLMs are going to be used to approximate human opinions, it is necessary to investigate the extent to which LLMs also reflect human response biases, if at all. In this work, we use survey design as a case study, where human response biases caused by permutations in wordings of "prompts" have been extensively studied. Drawing from prior work in social psychology, we design a dataset and propose a framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior. These inconsistencies tend to be more prominent in models that have been instruction fine-tuned. Furthermore, even if a model shows a significant change in the same direction as humans, we find that perturbations that are not meant to elicit significant changes in humans may also result in a similar change. These results highlight the potential pitfalls of using LLMs to substitute humans in parts of the annotation pipeline, and further underscore the importance of finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at //github.com/lindiatjuatja/BiasMonkey
The aim of this work is to create and apply a methodological approach for predicting gas traps from 3D seismic data and gas well testing. The paper formalizes the approach to creating a training dataset by selecting volumes with established gas saturation and filtration properties within the seismic wavefield. The training dataset thus created is used in a process stack of sequential application of data processing methods and ensemble machine learning algorithms. As a result, a cube of calibrated probabilities of belonging of the study space to gas reservoirs was obtained. The high efficiency of this approach is shown on a delayed test sample of three wells (blind wells). The final value of the gas reservoir prediction quality metric f1 score was 0.893846.
Network meta-analysis combines aggregate data (AgD) from multiple randomised controlled trials, assuming that any effect modifiers are balanced across populations. Individual patient data (IPD) meta-regression is the ``gold standard'' method to relax this assumption, however IPD are frequently only available in a subset of studies. Multilevel network meta-regression (ML-NMR) extends IPD meta-regression to incorporate AgD studies whilst avoiding aggregation bias, but currently requires the aggregate-level likelihood to have a known closed form. Notably, this prevents application to time-to-event outcomes. We extend ML-NMR to individual-level likelihoods of any form, by integrating the individual-level likelihood function over the AgD covariate distributions to obtain the respective marginal likelihood contributions. We illustrate with two examples of time-to-event outcomes, showing the performance of ML-NMR in a simulated comparison with little loss of precision from a full IPD analysis, and demonstrating flexible modelling of baseline hazards using cubic M-splines with synthetic data on newly diagnosed multiple myeloma. ML-NMR is a general method for synthesising individual and aggregate level data in networks of all sizes. Extension to general likelihoods, including for survival outcomes, greatly increases the applicability of the method. R and Stan code is provided, and the methods are implemented in the multinma R package.
A crucial challenge for solving problems in conflict research is in leveraging the semi-supervised nature of the data that arise. Observed response data such as counts of battle deaths over time indicate latent processes of interest such as intensity and duration of conflicts, but defining and labeling instances of these unobserved processes requires nuance and imprecision. The availability of such labels, however, would make it possible to study the effect of intervention-related predictors - such as ceasefires - directly on conflict dynamics (e.g., latent intensity) rather than through an intermediate proxy like observed counts of battle deaths. Motivated by this problem and the new availability of the ETH-PRIO Civil Conflict Ceasefires data set, we propose a Bayesian autoregressive (AR) hidden Markov model (HMM) framework as a sufficiently flexible machine learning approach for semi-supervised regime labeling with uncertainty quantification. We motivate our approach by illustrating the way it can be used to study the role that ceasefires play in shaping conflict dynamics. This ceasefires data set is the first systematic and globally comprehensive data on ceasefires, and our work is the first to analyze this new data and to explore the effect of ceasefires on conflict dynamics in a comprehensive and cross-country manner.
Insurers increasingly use AI. We distinguish two situations in which insurers use AI: (i) data-intensive underwriting, and (ii) behaviour-based insurance. (i) First, insurers can use AI for data analysis to assess risks: data-intensive underwriting. Underwriting is, in short, calculating risks and amending the insurance premium accordingly. (ii) Second, insurers can use AI to monitor the behaviour of consumers in real-time: behaviour-based insurance. For example, some car insurers give a discount if a consumer agrees to being tracked by the insurer and drives safely. While the two trends bring many advantages, they may also have discriminatory effects. This paper focuses on the following question. Which discrimination-related effects may occur if insurers use data-intensive underwriting and behaviour-based insurance? We focus on two types of discrimination-related effects: discrimination and other unfair differentiation. (i) Discrimination harms certain groups who are protected by non-discrimination law, for instance people with certain ethnicities. (ii) Unfair differentiation does not harm groups that are protected by non-discrimination law, but it does seem unfair. We introduce four factors to consider when assessing the fairness of insurance practices. The paper builds on literature from various disciplines including law, philosophy, and computer science.
Methods for estimating heterogeneous treatment effects (HTE) from observational data have largely focused on continuous or binary outcomes, with less attention paid to survival outcomes and almost none to settings with competing risks. In this work, we develop censoring unbiased transformations (CUTs) for survival outcomes both with and without competing risks.After converting time-to-event outcomes using these CUTs, direct application of HTE learners for continuous outcomes yields consistent estimates of heterogeneous cumulative incidence effects, total effects, and separable direct effects. Our CUTs enable application of a much larger set of state of the art HTE learners for censored outcomes than had previously been available, especially in competing risks settings. We provide generic model-free learner-specific oracle inequalities bounding the finite-sample excess risk. The oracle efficiency results depend on the oracle selector and estimated nuisance functions from all steps involved in the transformation. We demonstrate the empirical performance of the proposed methods in simulation studies.
Named Entity Recognition seeks to extract substrings within a text that name real-world objects and to determine their type (for example, whether they refer to persons or organizations). In this survey, we first present an overview of recent popular approaches, but we also look at graph- and transformer- based methods including Large Language Models (LLMs) that have not had much coverage in other surveys. Second, we focus on methods designed for datasets with scarce annotations. Third, we evaluate the performance of the main NER implementations on a variety of datasets with differing characteristics (as regards their domain, their size, and their number of classes). We thus provide a deep comparison of algorithms that are never considered together. Our experiments shed some light on how the characteristics of datasets affect the behavior of the methods that we compare.
Dynamical systems across the sciences, from electrical circuits to ecological networks, undergo qualitative and often catastrophic changes in behavior, called bifurcations, when their underlying parameters cross a threshold. Existing methods predict oncoming catastrophes in individual systems but are primarily time-series-based and struggle both to categorize qualitative dynamical regimes across diverse systems and to generalize to real data. To address this challenge, we propose a data-driven, physically-informed deep-learning framework for classifying dynamical regimes and characterizing bifurcation boundaries based on the extraction of topologically invariant features. We focus on the paradigmatic case of the supercritical Hopf bifurcation, which is used to model periodic dynamics across a wide range of applications. Our convolutional attention method is trained with data augmentations that encourage the learning of topological invariants which can be used to detect bifurcation boundaries in unseen systems and to design models of biological systems like oscillatory gene regulatory networks. We further demonstrate our method's use in analyzing real data by recovering distinct proliferation and differentiation dynamics along pancreatic endocrinogenesis trajectory in gene expression space based on single-cell data. Our method provides valuable insights into the qualitative, long-term behavior of a wide range of dynamical systems, and can detect bifurcations or catastrophic transitions in large-scale physical and biological systems.
Power posteriors "robustify" standard Bayesian inference by raising the likelihood to a constant fractional power, effectively downweighting its influence in the calculation of the posterior. Power posteriors have been shown to be more robust to model misspecification than standard posteriors in many settings. Previous work has shown that power posteriors derived from low-dimensional, parametric locally asymptotically normal models are asymptotically normal (Bernstein-von Mises) even under model misspecification. We extend these results to show that the power posterior moments converge to those of the limiting normal distribution suggested by the Bernstein-von Mises theorem. We then use this result to show that the mean of the power posterior, a point estimator, is asymptotically equivalent to the maximum likelihood estimator.
One critical issue for chat systems is to stay consistent about preferences, opinions, beliefs and facts of itself, which has been shown a difficult problem. In this work, we study methods to assess and bolster utterance consistency of chat systems. A dataset is first developed for studying the inconsistencies, where inconsistent dialogue responses, explanations of the inconsistencies, and recovery utterances are authored by annotators. This covers the life span of inconsistencies, namely introduction, understanding, and resolution. Building on this, we introduce a set of tasks centered on dialogue consistency, specifically focused on its detection and resolution. Our experimental findings indicate that our dataset significantly helps the progress in identifying and resolving conversational inconsistencies, and current popular large language models like ChatGPT which are good at resolving inconsistencies however still struggle with detection.
Deep learning constitutes a recent, modern technique for image processing and data analysis, with promising results and large potential. As deep learning has been successfully applied in various domains, it has recently entered also the domain of agriculture. In this paper, we perform a survey of 40 research efforts that employ deep learning techniques, applied to various agricultural and food production challenges. We examine the particular agricultural problems under study, the specific models and frameworks employed, the sources, nature and pre-processing of data used, and the overall performance achieved according to the metrics used at each work under study. Moreover, we study comparisons of deep learning with other existing popular techniques, in respect to differences in classification or regression performance. Our findings indicate that deep learning provides high accuracy, outperforming existing commonly used image processing techniques.