Machine translation often suffers from biased data and algorithms that can lead to unacceptable errors in system output. While bias in gender norms has been investigated, less is known about whether MT systems encode bias about social relationships, e.g. sentences such as "the lawyer kissed her wife." We investigate the degree of bias against same-gender relationships in MT systems, using generated template sentences drawn from several noun-gender languages (e.g. Spanish). We find that three popular MT services consistently fail to accurately translate sentences concerning relationships between nouns of the same gender. The error rate varies considerably based on the context, e.g. same-gender sentences referencing high female-representation occupations are translated with lower accuracy. We provide this work as a case study in the evaluation of intrinsic bias in NLP systems, with respect to social relationships.
In this paper I will develop a lambda-term calculus, lambda-2Int, for a bi-intuitionistic logic and discuss its implications for the notions of sense and denotation of derivations in a bilateralist setting. Thus, I will use the Curry-Howard correspondence, which has been well-established between the simply typed lambda-calculus and natural deduction systems for intuitionistic logic, and apply it to a bilateralist proof system displaying two derivability relations, one for proving and one for refuting. The basis will be the natural deduction system of Wansing's bi-intuitionistic logic 2Int, which I will turn into a term-annotated form. Therefore, we need a type theory that extends to a two-sorted typed lambda-calculus. I will present such a term-annotated proof system for 2Int and prove a Dualization Theorem relating proofs and refutations in this system. On the basis of these formal results I will argue that this gives us interesting insights into questions about sense and denotation as well as synonymy and identity of proofs from a bilateralist point of view.
We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel method for simulation-based inference in models where the evaluation of the likelihood function is not tractable and only a simulator that can generate synthetic data is available. SSNL fits a dimensionality-reducing surjective normalizing flow model and uses it as a surrogate likelihood function which allows for conventional Bayesian inference using either Markov chain Monte Carlo methods or variational inference. By embedding the data in a low-dimensional space, SSNL solves several issues previous likelihood-based methods had when applied to high-dimensional data sets that, for instance, contain non-informative data dimensions or lie along a lower-dimensional manifold. We evaluate SSNL on a wide variety of experiments and show that it generally outperforms contemporary methods used in simulation-based inference, for instance, on a challenging real-world example from astrophysics which models the magnetic field strength of the sun using a solar dynamo model.
Byzantine consensus allows n processes to decide on a common value, in spite of arbitrary failures. The seminal Dolev-Reischuk bound states that any deterministic solution to Byzantine consensus exchanges Omega(n^2) bits. In recent years, great advances have been made in deterministic Byzantine agreement for partially synchronous networks, with state-of-the-art cryptographic solutions achieving O(n^2 \kappa) bits (where $\kappa$ is the security parameter) and nearly matching the lower bound. In contrast, for synchronous networks, optimal solutions with O(n^2) bits, with no cryptography and the same failure tolerance, have been known for more than three decades. Can this gap in network models be closed? In this paper, we present Repeater, the first generic transformation of Byzantine agreement algorithms from synchrony to partial synchrony. Repeater is modular, relying on existing and novel algorithms for its sub-modules. With the right choice of modules, Repeater requires no additional cryptography, is optimally resilient (n = 3t+1, where t is the maximum number of failures) and, for constant-size inputs, preserves the worst-case per-process bit complexity of the transformed synchronous algorithm. Leveraging Repeater, we present the first partially synchronous algorithm that (1) achieves optimal bit complexity (O(n^2) bits), (2) resists a computationally unbounded adversary (no cryptography), and (3) is optimally-resilient (n = 3t+1), thus showing that the Dolev-Reischuk bound is tight in partial synchrony. Moreover, we adapt Repeater for long inputs, introducing several new algorithms with improved complexity and weaker (or completely absent) cryptographic assumptions.
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.
We use the illness-death model (IDM) for chronic conditions to derive a new analytical relation between the transition rates between the states of the IDM. The transition rates are the incidence rate (i) and the mortality rates of people without disease (m0) and with disease (m1). For the most generic case, the rates depend on age, calendar time and in case of m1 also on the duration of the disease. In this work, we show that the prevalence-odds can be expressed as a convolution-like product of the incidence rate and an exponentiated linear combination of i, m0 and m1. The analytical expression can be used as the basis for a maximum likelihood estimation (MLE) and associated large sample asymptotics. In a simulation study where a cross-sectional trial about a chronic condition is mimicked, we estimate the duration dependency of the mortality rate m1 based on aggregated current status data using the ML estimator. For this, the number of study participants and the number of diseased people in eleven age groups are considered. The ML estimator provides reasonable estimates for the parameters including their large sample confidence bounds.
There is increasing interest in the application large language models (LLMs) to the medical field, in part because of their impressive performance on medical exam questions. While promising, exam questions do not reflect the complexity of real patient-doctor interactions. In reality, physicians' decisions are shaped by many complex factors, such as patient compliance, personal experience, ethical beliefs, and cognitive bias. Taking a step toward understanding this, our hypothesis posits that when LLMs are confronted with clinical questions containing cognitive biases, they will yield significantly less accurate responses compared to the same questions presented without such biases. In this study, we developed BiasMedQA, a benchmark for evaluating cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and the medically specialized PMC Llama 13B. We tested these models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to replicate common clinically-relevant cognitive biases. Our analysis revealed varying effects for biases on these LLMs, with GPT-4 standing out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias. Our findings highlight the critical need for bias mitigation in the development of medical LLMs, pointing towards safer and more reliable applications in healthcare.
We present a unification and generalization of sequentially and hierarchically semi-separable (SSS and HSS) matrices called tree semi-separable (TSS) matrices. Our main result is to show that any dense matrix can be expressed in a TSS format. Here, the dimensions of the generators are specified by the ranks of the Hankel blocks of the matrix. TSS matrices satisfy a graph-induced rank structure (GIRS) property. It is shown that TSS matrices generalize the algebraic properties of SSS and HSS matrices under addition, products, and inversion. Subsequently, TSS matrices admit linear time matrix-vector multiply, matrix-matrix multiply, matrix-matrix addition, inversion, and solvers.
New biological assays like Perturb-seq link highly parallel CRISPR interventions to a high-dimensional transcriptomic readout, providing insight into gene regulatory networks. Causal gene regulatory networks can be represented by directed acyclic graph (DAGs), but learning DAGs from observational data is complicated by lack of identifiability and a combinatorial solution space. Score-based structure learning improves practical scalability of inferring DAGs. Previous score-based methods are sensitive to error variance structure; on the other hand, estimation of error variance is difficult without prior knowledge of structure. Accordingly, we present $\texttt{dotears}$ [doo-tairs], a continuous optimization framework which leverages observational and interventional data to infer a single causal structure, assuming a linear Structural Equation Model (SEM). $\texttt{dotears}$ exploits structural consequences of hard interventions to give a marginal estimate of exogenous error structure, bypassing the circular estimation problem. We show that $\texttt{dotears}$ is a provably consistent estimator of the true DAG under mild assumptions. $\texttt{dotears}$ outperforms other methods in varied simulations, and in real data infers edges that validate with higher precision and recall than state-of-the-art methods through differential expression tests and high-confidence protein-protein interactions.
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.