The problem of regression extrapolation, or out-of-distribution generalization, arises when predictions are required at test points outside the range of the training data. In such cases, the non-parametric guarantees for regression methods from both statistics and machine learning typically fail. Based on the theory of tail dependence, we propose a novel statistical extrapolation principle. After a suitable, data-adaptive marginal transformation, it assumes a simple relationship between predictors and the response at the boundary of the training predictor samples. This assumption holds for a wide range of models, including non-parametric regression functions with additive noise. Our semi-parametric method, progression, leverages this extrapolation principle and offers guarantees on the approximation error beyond the training data range. We demonstrate how this principle can be effectively integrated with existing approaches, such as random forests and additive models, to improve extrapolation performance on out-of-distribution samples.
Background: The standard regulatory approach to assess replication success is the two-trials rule, requiring both the original and the replication study to be significant with effect estimates in the same direction. The sceptical p-value was recently presented as an alternative method for the statistical assessment of the replicability of study results. Methods: We compare the statistical properties of the sceptical p-value and the two-trials rule. We illustrate the performance of the different methods using real-world evidence emulations of randomized, controlled trials (RCTs) conducted within the RCT DUPLICATE initiative. Results: The sceptical p-value depends not only on the two p-values, but also on sample size and effect size of the two studies. It can be calibrated to have the same Type-I error rate as the two-trials rule, but has larger power to detect an existing effect. In the application to the results from the RCT DUPLICATE initiative, the sceptical p-value leads to qualitatively similar results than the two-trials rule, but tends to show more evidence for treatment effects compared to the two-trials rule. Conclusion: The sceptical p-value represents a valid statistical measure to assess the replicability of study results and is especially useful in the context of real-world evidence emulations.
The goal of uplift modeling is to recommend actions that optimize specific outcomes by determining which entities should receive treatment. One common approach involves two steps: first, an inference step that estimates conditional average treatment effects (CATEs), and second, an optimization step that ranks entities based on their CATE values and assigns treatment to the top k within a given budget. While uplift modeling typically focuses on binary treatments, many real-world applications are characterized by continuous-valued treatments, i.e., a treatment dose. This paper presents a predict-then-optimize framework to allow for continuous treatments in uplift modeling. First, in the inference step, conditional average dose responses (CADRs) are estimated from data using causal machine learning techniques. Second, in the optimization step, we frame the assignment task of continuous treatments as a dose-allocation problem and solve it using integer linear programming (ILP). This approach allows decision-makers to efficiently and effectively allocate treatment doses while balancing resource availability, with the possibility of adding extra constraints like fairness considerations or adapting the objective function to take into account instance-dependent costs and benefits to maximize utility. The experiments compare several CADR estimators and illustrate the trade-offs between policy value and fairness, as well as the impact of an adapted objective function. This showcases the framework's advantages and flexibility across diverse applications in healthcare, lending, and human resource management. All code is available on github.com/SimonDeVos/UMCT.
We consider the problem of causal inference based on observational data (or the related missing data problem) with a binary or discrete treatment variable. In that context, we study inference for the counterfactual density functions and contrasts thereof, which can provide more nuanced information than counterfactual means and the average treatment effect. We impose the shape-constraint of log-concavity, a type of unimodality constraint, on the counterfactual densities, and then develop doubly robust estimators of the log-concave counterfactual density based on augmented inverse-probability weighted pseudo-outcomes. We provide conditions under which the estimator is consistent in various global metrics. We also develop asymptotically valid pointwise confidence intervals for the counterfactual density functions and differences and ratios thereof, which serve as a building block for more comprehensive analyses of distributional differences. We also present a method for using our estimator to implement density confidence bands.
An extremely schematic model of the forces acting an a sailing yacht equipped with a system of foils is here presented and discussed. The role of the foils is to raise the hull from the water in order to reduce the total resistance and then increase the speed. Some CFD simulations are providing the total resistance of the bare hull at some values of speed and displacement, as well as the characteristics (drag and lift coefficients) of the 2D foil sections used for the appendages. A parametric study has been performed for the characterization of a foil of finite dimensions. The equilibrium of the vertical forces and longitudinal moments, as well as a reduced displacement, is obtained by controlling the pitch angle of the foils. The value of the total resistance of the yacht with foils is then compared with the case without foils, evidencing the speed regime where an advantage is obtained, if any.
Automatic differentiation is everywhere, but there exists only minimal documentation of how it works in complex arithmetic beyond stating "derivatives in $\mathbb{C}^d$" $\cong$ "derivatives in $\mathbb{R}^{2d}$" and, at best, shallow references to Wirtinger calculus. Unfortunately, the equivalence $\mathbb{C}^d \cong \mathbb{R}^{2d}$ becomes insufficient as soon as we need to derive custom gradient rules, e.g., to avoid differentiating "through" expensive linear algebra functions or differential equation simulators. To combat such a lack of documentation, this article surveys forward- and reverse-mode automatic differentiation with complex numbers, covering topics such as Wirtinger derivatives, a modified chain rule, and different gradient conventions while explicitly avoiding holomorphicity and the Cauchy--Riemann equations (which would be far too restrictive). To be precise, we will derive, explain, and implement a complex version of Jacobian-vector and vector-Jacobian products almost entirely with linear algebra without relying on complex analysis or differential geometry. This tutorial is a call to action, for users and developers alike, to take complex values seriously when implementing custom gradient propagation rules -- the manuscript explains how.
For several classes of neural PDE solvers (Deep Ritz, PINNs, DeepONets), the ability to approximate the solution or solution operator to a partial differential equation (PDE) hinges on the abilitiy of a neural network to approximate the solution in the spatial variables. We analyze the capacity of neural networks to approximate solutions to an elliptic PDE assuming that the boundary condition can be approximated efficiently. Our focus is on the Laplace operator with Dirichlet boundary condition on a half space and on neural networks with a single hidden layer and an activation function that is a power of the popular ReLU activation function.
Weighting with the inverse probability of censoring is an approach to deal with censoring in regression analyses where the outcome may be missing due to right-censoring. In this paper, three separate approaches involving this idea in a setting where the Kaplan--Meier estimator is used for estimating the censoring probability are compared. In more detail, the three approaches involve weighted regression, regression with a weighted outcome, and regression of a jack-knife pseudo-observation based on a weighted estimator. Expressions of the asymptotic variances are given in each case and the expressions are compared to each other and to the uncensored case. In terms of low asymptotic variance, a clear winner cannot be found. Which approach will have the lowest asymptotic variance depends on the censoring distribution. Expressions of the limit of the standard sandwich variance estimator in the three cases are also provided, revealing an overestimation under the implied assumptions.
Static analysis by abstract interpretation is generally designed to be "sound", that is, it should not claim to establish properties that do not hold-in other words, not provide "false negatives" about possible bugs. A rarer requirement is that it should be "complete", meaning that it should be able to infer certain properties if they hold. This paper describes a number of practical issues and questions related to completeness that I have come across over the years.
Data scarcity and data imbalance have attracted a lot of attention in many fields. Data augmentation, explored as an effective approach to tackle them, can improve the robustness and efficiency of classification models by generating new samples. This paper presents REPRINT, a simple and effective hidden-space data augmentation method for imbalanced data classification. Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class by using subspaces spanned by principal components to summarize distribution structure of both source and target class. Consequently, the examples generated would diversify the target while maintaining the original geometry of target distribution. Besides, this method involves a label refinement component which allows to synthesize new soft labels for augmented examples. Compared with different NLP data augmentation approaches under a range of data imbalanced scenarios on four text classification benchmark, REPRINT shows prominent improvements. Moreover, through comprehensive ablation studies, we show that label refinement is better than label-preserving for augmented examples, and that our method suggests stable and consistent improvements in terms of suitable choices of principal components. Moreover, REPRINT is appealing for its easy-to-use since it contains only one hyperparameter determining the dimension of subspace and requires low computational resource.
Heteroskedasticity testing in nonparametric regression is a classic statistical problem with important practical applications, yet fundamental limits are unknown. Adopting a minimax perspective, this article considers the testing problem in the context of an $\alpha$-H\"{o}lder mean and a $\beta$-H\"{o}lder variance function. For $\alpha > 0$ and $\beta \in (0, 1/2)$, the sharp minimax separation rate $n^{-4\alpha} + n^{-4\beta/(4\beta+1)} + n^{-2\beta}$ is established. To achieve the minimax separation rate, a kernel-based statistic using first-order squared differences is developed. Notably, the statistic estimates a proxy rather than a natural quadratic functional (the squared distance between the variance function and its best $L^2$ approximation by a constant) suggested in previous work. The setting where no smoothness is assumed on the variance function is also studied; the variance profile across the design points can be arbitrary. Despite the lack of structure, consistent testing turns out to still be possible by using the Gaussian character of the noise, and the minimax rate is shown to be $n^{-4\alpha} + n^{-1/2}$. Exploiting noise information happens to be a fundamental necessity as consistent testing is impossible if nothing more than zero mean and unit variance is known about the noise distribution. Furthermore, in the setting where the variance function is $\beta$-H\"{o}lder but heteroskedasticity is measured only with respect to the design points, the minimax separation rate is shown to be $n^{-4\alpha} + n^{-\left((1/2) \vee (4\beta/(4\beta+1))\right)}$ when the noise is Gaussian and $n^{-4\alpha} + n^{-4\beta/(4\beta+1)} + n^{-2\beta}$ when the noise distribution is unknown.