Outcome phenotype measurement error is rarely corrected in comparative effect estimation studies in observational pharmacoepidemiology. Quantitative bias analysis (QBA) is a misclassification correction method that algebraically adjusts person counts in exposure-outcome contingency tables to reflect the magnitude of misclassification. The extent QBA minimizes bias is unclear because few systematic evaluations have been reported. We empirically evaluated QBA impact on odds ratios (OR) in several comparative effect estimation scenarios. We estimated non-differential and differential phenotype errors with internal validation studies using a probabilistic reference. Further, we synthesized an analytic space defined by outcome incidence, uncorrected ORs, and phenotype errors to identify which combinations produce invalid results indicative of input errors. We evaluated impact with relative bias [(OR-ORQBA)]/OR*100%]. Results were considered invalid if any contingency table cell was corrected to a negative number. Empirical bias correction was greatest in lower incidence scenarios where uncorrected ORs were larger. Similarly, synthetic bias correction was greater in lower incidence settings with larger uncorrected estimates. The invalid proportion of synthetic scenarios increased as uncorrected estimates increased. Results were invalid in common, low incidence scenarios indicating problematic inputs. This demonstrates the importance of accurately and precisely estimating phenotype errors before implementing QBA in comparative effect estimation studies.
Internet measurements are a crucial foundation of IPv6-related research. Due to the infeasibility of full address space scans for IPv6 however, those measurements rely on collections of reliably responsive, unbiased addresses, as provided e.g., by the IPv6 Hitlist service. Although used for various use cases, the hitlist provides an unfiltered list of responsive addresses, the hosts behind which can come from a range of different networks and devices, such as web servers, customer-premises equipment (CPE) devices, and Internet infrastructure. In this paper, we demonstrate the importance of tailoring hitlists in accordance with the research goal in question. By using PeeringDB we classify hitlist addresses into six different network categories, uncovering that 42% of hitlist addresses are in ISP networks. Moreover, we show the different behavior of those addresses depending on their respective category, e.g., ISP addresses exhibiting a relatively low lifetime. Furthermore, we analyze different Target Generation Algorithms (TGAs), which are used to increase the coverage of IPv6 measurements by generating new responsive targets for scans. We evaluate their performance under various conditions and find generated addresses to show vastly differing responsiveness levels for different TGAs.
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
We develop a statistical toolbox for a quantitative model evaluation of stochastic reaction-diffusion systems modeling space-time evolution of biophysical quantities on the intracellular level. Starting from space-time data $X_N(t,x)$, as, e.g., provided in fluorescence microscopy recordings, we discuss basic modelling principles for conditional mean trend and fluctuations in the class of stochastic reaction-diffusion systems, and subsequently develop statistical inference methods for parameter estimation. With a view towards application to real data, we discuss estimation errors and confidence intervals, in particular in dependence of spatial resolution of measurements, and investigate the impact of misspecified reaction terms and noise coefficients. We also briefly touch implementation issues of the statistical estimators. As a proof of concept we apply our toolbox to the statistical inference on intracellular actin concentration in the social amoeba Dictyostelium discoideum.
Software testing is a mandatory activity in any serious software development process, as bugs are a reality in software development. This raises the question of quality: good tests are effective in finding bugs, but until a test case actually finds a bug, its effectiveness remains unknown. Therefore, determining what constitutes a good or bad test is necessary. This is not a simple task, and there are a number of studies that identify different characteristics of a good test case. A previous study evaluated 29 hypotheses regarding what constitutes a good test case, but the findings are based on developers' beliefs, which are subjective and biased. In this paper we investigate eight of these hypotheses, through an extensive empirical study based on open software repositories. Despite our best efforts, we were unable to find evidence that supports these beliefs. This indicates that, although these hypotheses represent good software engineering advice, they do not necessarily mean that they are enough to provide the desired outcome of good testing code.
The number of modes in a probability density function is representative of the model's complexity and can also be viewed as the number of existing subpopulations. Despite its relevance, little research has been devoted to its estimation. Focusing on the univariate setting, we propose a novel approach targeting prediction accuracy inspired by some overlooked aspects of the problem. We argue for the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view blending global and local density properties. Our method builds upon a combination of flexible kernel estimators and parsimonious compositional splines. Feature exploration, model selection and mode testing are implemented in the Bayesian inference paradigm, providing soft solutions and allowing to incorporate expert judgement in the process. The usefulness of our proposal is illustrated through a case study in sports analytics, showcasing multiple companion visualisation tools. A thorough simulation study demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, our method emerges as a top-tier alternative offering innovative solutions for analysts.
This paper explores the space of (propositional) probabilistic logical languages, ranging from a purely `qualitative' comparative language to a highly `quantitative' language involving arbitrary polynomials over probability terms. While talk of qualitative vs. quantitative may be suggestive, we identify a robust and meaningful boundary in the space by distinguishing systems that encode (at most) additive reasoning from those that encode additive and multiplicative reasoning. The latter includes not only languages with explicit multiplication but also languages expressing notions of dependence and conditionality. We show that the distinction tracks a divide in computational complexity: additive systems remain complete for $\mathsf{NP}$, while multiplicative systems are robustly complete for $\exists\mathbb{R}$. We also address axiomatic questions, offering several new completeness results as well as a proof of non-finite-axiomatizability for comparative probability. Repercussions of our results for conceptual and empirical questions are addressed, and open problems are discussed.
The optimal prediction strategy for out-of-distribution (OOD) setups is a fundamental question in machine learning. In this paper, we address this question and present several contributions. We propose three reject option models for OOD setups: the Cost-based model, the Bounded TPR-FPR model, and the Bounded Precision-Recall model. These models extend the standard reject option models used in non-OOD setups and define the notion of an optimal OOD selective classifier. We establish that all the proposed models, despite their different formulations, share a common class of optimal strategies. Motivated by the optimal strategy, we introduce double-score OOD methods that leverage uncertainty scores from two chosen OOD detectors: one focused on OOD/ID discrimination and the other on misclassification detection. The experimental results consistently demonstrate the superior performance of this simple strategy compared to state-of-the-art methods. Additionally, we propose novel evaluation metrics derived from the definition of the optimal strategy under the proposed OOD rejection models. These new metrics provide a comprehensive and reliable assessment of OOD methods without the deficiencies observed in existing evaluation approaches.
In this paper, we provide a novel framework for the analysis of generalization error of first-order optimization algorithms for statistical learning when the gradient can only be accessed through partial observations given by an oracle. Our analysis relies on the regularity of the gradient w.r.t. the data samples, and allows to derive near matching upper and lower bounds for the generalization error of multiple learning problems, including supervised learning, transfer learning, robust learning, distributed learning and communication efficient learning using gradient quantization. These results hold for smooth and strongly-convex optimization problems, as well as smooth non-convex optimization problems verifying a Polyak-Lojasiewicz assumption. In particular, our upper and lower bounds depend on a novel quantity that extends the notion of conditional standard deviation, and is a measure of the extent to which the gradient can be approximated by having access to the oracle. As a consequence, our analysis provides a precise meaning to the intuition that optimization of the statistical learning objective is as hard as the estimation of its gradient. Finally, we show that, in the case of standard supervised learning, mini-batch gradient descent with increasing batch sizes and a warm start can reach a generalization error that is optimal up to a multiplicative factor, thus motivating the use of this optimization scheme in practical applications.
Matching and weighting methods for observational studies involve the choice of an estimand, the causal effect with reference to a specific target population. Commonly used estimands include the average treatment effect in the treated (ATT), the average treatment effect in the untreated (ATU), the average treatment effect in the population (ATE), and the average treatment effect in the overlap (i.e., equipoise population; ATO). Each estimand has its own assumptions, interpretation, and statistical methods that can be used to estimate it. This article provides guidance on selecting and interpreting an estimand to help medical researchers correctly implement statistical methods used to estimate causal effects in observational studies and to help audiences correctly interpret the results and limitations of these studies. The interpretations of the estimands resulting from regression and instrumental variable analyses are also discussed. Choosing an estimand carefully is essential for making valid inferences from the analysis of observational data and ensuring results are replicable and useful for practitioners.