Emergent chain-of-thought (CoT) reasoning capabilities promise to improve performance and explainability of large language models (LLMs). However, uncertainties remain about how reasoning strategies formulated for previous model generations generalize to new model generations and different datasets. In this small-scale study, we compare different reasoning strategies induced by zero-shot prompting across six recently released LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl and Cohere command-xlarge) on a mixture of six question-answering datasets, including datasets from scientific and medical domains. Our findings demonstrate that while some variations in effectiveness occur, gains from CoT reasoning strategies remain robust across different models and datasets. GPT-4 has the most benefit from current state-of-the-art reasoning strategies and exhibits the best performance by applying a prompt previously discovered through automated discovery.
The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
We construct a family of Markov decision processes for which the policy iteration algorithm needs an exponential number of improving switches with Dantzig's rule, with Bland's rule, and with the Largest Increase pivot rule. This immediately translates to a family of linear programs for which the simplex algorithm needs an exponential number of pivot steps with the same three pivot rules. Our results yield a unified construction that simultaneously reproduces well-known lower bounds for these classical pivot rules, and we are able to infer that any (deterministic or randomized) combination of them cannot avoid an exponential worst-case behavior. Regarding the policy iteration algorithm, pivot rules typically switch multiple edges simultaneously and our lower bound for Dantzig's rule and the Largest Increase rule, which perform only single switches, seem novel. Regarding the simplex algorithm, the individual lower bounds were previously obtained separately via deformed hypercube constructions. In contrast to previous bounds for the simplex algorithm via Markov decision processes, our rigorous analysis is reasonably concise.
The numerical solution of continuum damage mechanics (CDM) problems suffers from convergence-related challenges during the material softening stage, and consequently existing iterative solvers are subject to a trade-off between computational expense and solution accuracy. In this work, we present a novel unified arc-length (UAL) method, and we derive the formulation of the analytical tangent matrix and governing system of equations for both local and non-local gradient damage problems. Unlike existing versions of arc-length solvers that monolithically scale the external force vector, the proposed method treats the latter as an independent variable and determines the position of the system on the equilibrium path based on all the nodal variations of the external force vector. This approach renders the proposed solver substantially more efficient and robust than existing solvers used in CDM problems. We demonstrate the considerable advantages of the proposed algorithm through several benchmark 1D problems with sharp snap-backs and 2D examples under various boundary conditions and loading scenarios. The proposed UAL approach exhibits a superior ability of overcoming critical increments along the equilibrium path. Moreover, the proposed UAL method is 1-2 orders of magnitude faster than force-controlled arc-length and monolithic Newton-Raphson solvers.
Dye experimentation is a widely used method in experimental fluid mechanics for flow analysis or for the study of the transport of particles within a fluid. This technique is particularly useful in biomedical diagnostic applications ranging from hemodynamic analysis of cardiovascular systems to ocular circulation. However, simulating dyes governed by convection-diffusion partial differential equations (PDEs) can also be a useful post-processing analysis approach for computational fluid dynamics (CFD) applications. Such simulations can be used to identify the relative significance of different spatial subregions in particular time intervals of interest in an unsteady flow field. Additionally, dye evolution is closely related to non-discrete particle residence time (PRT) calculations that are governed by similar PDEs. This contribution introduces a pseudo-spectral method based on Fourier continuation (FC) for conducting dye simulations and non-discrete particle residence time calculations without numerical diffusion errors. Convergence and error analyses are performed with both manufactured and analytical solutions. The methodology is applied to three distinct physical/physiological cases: 1) flow over a two-dimensional (2D) cavity; 2) pulsatile flow in a simplified partially-grafted aortic dissection model; and 3) non-Newtonian blood flow in a Fontan graft. Although velocity data is provided in this work by numerical simulation, the proposed approach can also be applied to velocity data collected through experimental techniques such as from particle image velocimetry.
Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
Recently established, directed dependence measures for pairs $(X,Y)$ of random variables build upon the natural idea of comparing the conditional distributions of $Y$ given $X=x$ with the marginal distribution of $Y$. They assign pairs $(X,Y)$ values in $[0,1]$, the value is $0$ if and only if $X,Y$ are independent, and it is $1$ exclusively for $Y$ being a function of $X$. Here we show that comparing randomly drawn conditional distributions with each other instead or, equivalently, analyzing how sensitive the conditional distribution of $Y$ given $X=x$ is on $x$, opens the door to constructing novel families of dependence measures $\Lambda_\varphi$ induced by general convex functions $\varphi: \mathbb{R} \rightarrow \mathbb{R}$, containing, e.g., Chatterjee's coefficient of correlation as special case. After establishing additional useful properties of $\Lambda_\varphi$ we focus on continuous $(X,Y)$, translate $\Lambda_\varphi$ to the copula setting, consider the $L^p$-version and establish an estimator which is strongly consistent in full generality. A real data example and a simulation study illustrate the chosen approach and the performance of the estimator. Complementing the afore-mentioned results, we show how a slight modification of the construction underlying $\Lambda_\varphi$ can be used to define new measures of explainability generalizing the fraction of explained variance.
We consider a one-dimensional singularly perturbed 4th order problem with the additional feature of a shift term. An expansion into a smooth term, boundary layers and an inner layer yields a formal solution decomposition, and together with a stability result we have estimates for the subsequent numerical analysis. With classical layer adapted meshes we present a numerical method, that achieves supercloseness and optimal convergence orders in the associated energy norm. We also consider coarser meshes in view of the weak layers. Some numerical examples conclude the paper and support the theory.
Grammatical cues are sometimes redundant with word meanings in natural language. For instance, English word order rules constrain the word order of a sentence like "The dog chewed the bone" even though the status of "dog" as subject and "bone" as object can be inferred from world knowledge and plausibility. Quantifying how often this redundancy occurs, and how the level of redundancy varies across typologically diverse languages, can shed light on the function and evolution of grammar. To that end, we performed a behavioral experiment in English and Russian and a cross-linguistic computational analysis measuring the redundancy of grammatical cues in transitive clauses extracted from corpus text. English and Russian speakers (n=484) were presented with subjects, verbs, and objects (in random order and with morphological markings removed) extracted from naturally occurring sentences and were asked to identify which noun is the subject of the action. Accuracy was high in both languages (~89% in English, ~87% in Russian). Next, we trained a neural network machine classifier on a similar task: predicting which nominal in a subject-verb-object triad is the subject. Across 30 languages from eight language families, performance was consistently high: a median accuracy of 87%, comparable to the accuracy observed in the human experiments. The conclusion is that grammatical cues such as word order are necessary to convey subjecthood and objecthood in a minority of naturally occurring transitive clauses; nevertheless, they can (a) provide an important source of redundancy and (b) are crucial for conveying intended meaning that cannot be inferred from the words alone, including descriptions of human interactions, where roles are often reversible (e.g., Ray helped Lu/Lu helped Ray), and expressing non-prototypical meanings (e.g., "The bone chewed the dog.").
We propose an approach to compute inner and outer-approximations of the sets of values satisfying constraints expressed as arbitrarily quantified formulas. Such formulas arise for instance when specifying important problems in control such as robustness, motion planning or controllers comparison. We propose an interval-based method which allows for tractable but tight approximations. We demonstrate its applicability through a series of examples and benchmarks using a prototype implementation.