A new approach to the local and global explanation is proposed. It is based on selecting a convex hull constructed for the finite number of points around an explained instance. The convex hull allows us to consider a dual representation of instances in the form of convex combinations of extreme points of a produced polytope. Instead of perturbing new instances in the Euclidean feature space, vectors of convex combination coefficients are uniformly generated from the unit simplex, and they form a new dual dataset. A dual linear surrogate model is trained on the dual dataset. The explanation feature importance values are computed by means of simple matrix calculations. The approach can be regarded as a modification of the well-known model LIME. The dual representation inherently allows us to get the example-based explanation. The neural additive model is also considered as a tool for implementing the example-based explanation approach. Many numerical experiments with real datasets are performed for studying the approach. The code of proposed algorithms is available.
We introduce a causal regularisation extension to anchor regression (AR) for improved out-of-distribution (OOD) generalisation. We present anchor-compatible losses, aligning with the anchor framework to ensure robustness against distribution shifts. Various multivariate analysis (MVA) algorithms, such as (Orthonormalized) PLS, RRR, and MLR, fall within the anchor framework. We observe that simple regularisation enhances robustness in OOD settings. Estimators for selected algorithms are provided, showcasing consistency and efficacy in synthetic and real-world climate science problems. The empirical validation highlights the versatility of anchor regularisation, emphasizing its compatibility with MVA approaches and its role in enhancing replicability while guarding against distribution shifts. The extended AR framework advances causal inference methodologies, addressing the need for reliable OOD generalisation.
In decision-making, maxitive functions are used for worst-case and best-case evaluations. Maxitivity gives rise to a rich structure that is well-studied in the context of the pointwise order. In this article, we investigate maxitivity with respect to general preorders and provide a representation theorem for such functionals. The results are illustrated for different stochastic orders in the literature, including the usual stochastic order, the increasing convex/concave order, and the dispersive order.
Data sets tend to live in low-dimensional non-linear subspaces. Ideal data analysis tools for such data sets should therefore account for such non-linear geometry. The symmetric Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of non-linear geometries that has been shown to be able to capture the data geometry through empirical evidence from classical non-linear embedding. Second, many standard data analysis tools initially developed for data in Euclidean space can also be generalised efficiently to data on a symmetric Riemannian manifold. A conceptual challenge comes from the lack of guidelines for constructing a symmetric Riemannian structure on the data space itself and the lack of guidelines for modifying successful algorithms on symmetric Riemannian manifolds for data analysis to this setting. This work considers these challenges in the setting of pullback Riemannian geometry through a diffeomorphism. The first part of the paper characterises diffeomorphisms that result in proper, stable and efficient data analysis. The second part then uses these best practices to guide construction of such diffeomorphisms through deep learning. As a proof of concept, different types of pullback geometries -- among which the proposed construction -- are tested on several data analysis tasks and on several toy data sets. The numerical experiments confirm the predictions from theory, i.e., that the diffeomorphisms generating the pullback geometry need to map the data manifold into a geodesic subspace of the pulled back Riemannian manifold while preserving local isometry around the data manifold for proper, stable and efficient data analysis, and that pulling back positive curvature can be problematic in terms of stability.
The purpose of anonymizing structured data is to protect the privacy of individuals in the data while retaining the statistical properties of the data. There is a large body of work that examines anonymization vulnerabilities. Focusing on strong anonymization mechanisms, this paper examines a number of prominent attack papers and finds several problems, all of which lead to overstating risk. First, some papers fail to establish a correct statistical inference baseline (or any at all), leading to incorrect measures. Notably, the reconstruction attack from the US Census Bureau that led to a redesign of its disclosure method made this mistake. We propose the non-member framework, an improved method for how to compute a more accurate inference baseline, and give examples of its operation. Second, some papers don't use a realistic membership base rate, leading to incorrect precision measures if precision is reported. Third, some papers unnecessarily report measures in such a way that it is difficult or impossible to assess risk. Virtually the entire literature on membership inference attacks, dozens of papers, make one or both of these errors. We propose that membership inference papers report precision/recall values using a representative range of base rates.
We consider linear bounded operators acting in Banach spaces with a basis, such operators can be represented by an infinite matrix. We prove that for an invertible operator there exists a sequence of invertible finite-dimensional operators so that the family of norms of their inverses is uniformly bounded. It leads to the fact that solutions of finite-dimensional equations converge to the solution of initial operator equation with infinite-dimensional matrix.
A higher-order change-of-measure multilevel Monte Carlo (MLMC) method is developed for computing weak approximations of the invariant measures of SDE with drift coefficients that do not satisfy the contractivity condition. This is achieved by introducing a spring term in the pairwise coupling of the MLMC trajectories employing the order 1.5 strong It\^o--Taylor method. Through this, we can recover the contractivity property of the drift coefficient while still retaining the telescoping sum property needed for implementing the MLMC method. We show that the variance of the change-of-measure MLMC method grows linearly in time $T$ for all $T > 0$, and for all sufficiently small timestep size $h > 0$. For a given error tolerance $\epsilon > 0$, we prove that the method achieves a mean-square-error accuracy of $O(\epsilon^2)$ with a computational cost of $O(\epsilon^{-2} \big\vert \log \epsilon \big\vert^{3/2} (\log \big\vert \log \epsilon \big\vert)^{1/2})$ for uniformly Lipschitz continuous payoff functions and $O \big( \epsilon^{-2} \big\vert \log \epsilon \big\vert^{5/3 + \xi} \big)$ for discontinuous payoffs, respectively, where $\xi > 0$. We also observe an improvement in the constant associated with the computational cost of the higher-order change-of-measure MLMC method, marking an improvement over the Milstein change-of-measure method in the aforementioned seminal work by M. Giles and W. Fang. Several numerical tests were performed to verify the theoretical results and assess the robustness of the method.
This article is concerned with the multilevel Monte Carlo (MLMC) methods for approximating expectations of some functions of the solution to the Heston 3/2-model from mathematical finance, which takes values in $(0, \infty)$ and possesses superlinearly growing drift and diffusion coefficients. To discretize the SDE model, a new Milstein-type scheme is proposed to produce independent sample paths. The proposed scheme can be explicitly solved and is positivity-preserving unconditionally, i.e., for any time step-size $h>0$. This positivity-preserving property for large discretization time steps is particularly desirable in the MLMC setting. Furthermore, a mean-square convergence rate of order one is proved in the non-globally Lipschitz regime, which is not trivial, as the diffusion coefficient grows super-linearly. The obtained order-one convergence in turn promises the desired relevant variance of the multilevel estimator and justifies the optimal complexity $\mathcal{O}(\epsilon^{-2})$ for the MLMC approach, where $\epsilon > 0$ is the required target accuracy. Numerical experiments are finally reported to confirm the theoretical findings.
In this manuscript, we combine non-intrusive reduced order models (ROMs) with space-dependent aggregation techniques to build a mixed-ROM. The prediction of the mixed formulation is given by a convex linear combination of the predictions of some previously-trained ROMs, where we assign to each model a space-dependent weight. The ROMs taken into account to build the mixed model exploit different reduction techniques, such as Proper Orthogonal Decomposition (POD) and AutoEncoders (AE), and/or different approximation techniques, namely a Radial Basis Function Interpolation (RBF), a Gaussian Process Regression (GPR) or a feed-forward Artificial Neural Network (ANN). The contribution of each model is retained with higher weights in the regions where the model performs best, and, vice versa, with smaller weights where the model has a lower accuracy with respect to the other models. Finally, a regression technique, namely a Random Forest, is exploited to evaluate the weights for unseen conditions. The performance of the aggregated model is evaluated on two different test cases: the 2D flow past a NACA 4412 airfoil, with an angle of attack of 5 degrees, having as parameter the Reynolds number varying between 1e5 and 1e6 and a transonic flow over a NACA 0012 airfoil, considering as parameter the angle of attack. In both cases, the mixed-ROM has provided improved accuracy with respect to each individual ROM technique.
To date, most methods for simulating conditioned diffusions are limited to the Euclidean setting. The conditioned process can be constructed using a change of measure known as Doob's $h$-transform. The specific type of conditioning depends on a function $h$ which is typically unknown in closed form. To resolve this, we extend the notion of guided processes to a manifold $M$, where one replaces $h$ by a function based on the heat kernel on $M$. We consider the case of a Brownian motion with drift, constructed using the frame bundle of $M$, conditioned to hit a point $x_T$ at time $T$. We prove equivalence of the laws of the conditioned process and the guided process with a tractable Radon-Nikodym derivative. Subsequently, we show how one can obtain guided processes on any manifold $N$ that is diffeomorphic to $M$ without assuming knowledge of the heat kernel on $N$. We illustrate our results with numerical simulations and an example of parameter estimation where a diffusion process on the torus is observed discretely in time.
The purpose of this study is to introduce a new approach to feature ranking for classification tasks, called in what follows greedy feature selection. In statistical learning, feature selection is usually realized by means of methods that are independent of the classifier applied to perform the prediction using that reduced number of features. Instead, greedy feature selection identifies the most important feature at each step and according to the selected classifier. In the paper, the benefits of such scheme are investigated theoretically in terms of model capacity indicators, such as the Vapnik-Chervonenkis (VC) dimension or the kernel alignment, and tested numerically by considering its application to the problem of predicting geo-effective manifestations of the active Sun.