Data generation remains a bottleneck in training surrogate models to predict molecular properties. We demonstrate that multitask Gaussian process regression overcomes this limitation by leveraging both expensive and cheap data sources. In particular, we consider training sets constructed from coupled-cluster (CC) and density function theory (DFT) data. We report that multitask surrogates can predict at CC level accuracy with a reduction to data generation cost by over an order of magnitude. Of note, our approach allows the training set to include DFT data generated by a heterogeneous mix of exchange-correlation functionals without imposing any artificial hierarchy on functional accuracy. More generally, the multitask framework can accommodate a wider range of training set structures -- including full disparity between the different levels of fidelity -- than existing kernel approaches based on $\Delta$-learning, though we show that the accuracy of the two approaches can be similar. Consequently, multitask regression can be a tool for reducing data generation costs even further by opportunistically exploiting existing data sources.
The paper presents a numerical method for simulating flow and mechanics in fractured rock. The governing equations that couple the effects in the rock mass and in the fractures are obtained using the discrete fracture-matrix approach. The fracture flow is driven by the cubic law, and the contact conditions prevent fractures from self-penetration. A stable finite element discretization is proposed for the displacement-pressure-flux formulation. The resulting nonlinear algebraic system of equations and inequalities is decoupled using a robust iterative splitting into the linearized flow subproblem, and the quadratic programming problem for the mechanical part. The non-penetration conditions are solved by means of dualization and an optimal quadratic programming algorithm. The capability of the numerical scheme is demonstrated on a benchmark problem for tunnel excavation with hundreds of fractures in 3D. The paper's novelty consists in a combination of three crucial ingredients: (i) application of discrete fracture-matrix approach to poroelasticity, (ii) robust iterative splitting of resulting nonlinear algebraic system working for real-world 3D problems, and (iii) efficient solution of its mechanical quadratic programming part with a large number of fractures in mutual contact by means of own solvers implemented into an in-house software library.
Signed graphs are an emergent way of representing data in a variety of contexts were conflicting interactions exist. These include data from biological, ecological, and social systems. Here we propose the concept of communicability geometry for signed graphs, proving that metrics in this space, such as the communicability distance and angles, are Euclidean and spherical. We then apply these metrics to solve several problems in data analysis of signed graphs in a unified way. They include the partitioning of signed graphs, dimensionality reduction, finding hierarchies of alliances in signed networks as well as the quantification of the degree of polarization between the existing factions in systems represented by this type of graphs.
We establish a near-optimality guarantee for the full orthogonalization method (FOM), showing that the overall convergence of FOM is nearly as good as GMRES. In particular, we prove that at every iteration $k$, there exists an iteration $j\leq k$ for which the FOM residual norm at iteration $j$ is no more than $\sqrt{k+1}$ times larger than the GMRES residual norm at iteration $k$. This bound is sharp, and it has implications for algorithms for approximating the action of a matrix function on a vector.
Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these algorithms either rely on technical devices that may yield unreliable numerical results, scale poorly, or require smoothness assumptions that many practical Maxent models lack. In this paper, we present novel optimization algorithms that overcome the shortcomings of state-of-the-art algorithms for training large-scale, non-smooth Maxent models. Our proposed first-order algorithms leverage the Kullback-Leibler divergence to train large-scale and non-smooth Maxent models efficiently. For Maxent models with discrete probability distribution of $n$ elements built from samples, each containing $m$ features, the stepsize parameters estimation and iterations in our algorithms scale on the order of $O(mn)$ operations and can be trivially parallelized. Moreover, the strong $\ell_{1}$ convexity of the Kullback--Leibler divergence allows for larger stepsize parameters, thereby speeding up the convergence rate of our algorithms. To illustrate the efficiency of our novel algorithms, we consider the problem of estimating probabilities of fire occurrences as a function of ecological features in the Western US MTBS-Interagency wildfire data set. Our numerical results show that our algorithms outperform the state of the arts by one order of magnitude and yield results that agree with physical models of wildfire occurrence and previous statistical analyses of wildfire drivers.
In decision-making, maxitive functions are used for worst-case and best-case evaluations. Maxitivity gives rise to a rich structure that is well-studied in the context of the pointwise order. In this article, we investigate maxitivity with respect to general preorders and provide a representation theorem for such functionals. The results are illustrated for different stochastic orders in the literature, including the usual stochastic order, the increasing convex/concave order, and the dispersive order.
Data sets tend to live in low-dimensional non-linear subspaces. Ideal data analysis tools for such data sets should therefore account for such non-linear geometry. The symmetric Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of non-linear geometries that has been shown to be able to capture the data geometry through empirical evidence from classical non-linear embedding. Second, many standard data analysis tools initially developed for data in Euclidean space can also be generalised efficiently to data on a symmetric Riemannian manifold. A conceptual challenge comes from the lack of guidelines for constructing a symmetric Riemannian structure on the data space itself and the lack of guidelines for modifying successful algorithms on symmetric Riemannian manifolds for data analysis to this setting. This work considers these challenges in the setting of pullback Riemannian geometry through a diffeomorphism. The first part of the paper characterises diffeomorphisms that result in proper, stable and efficient data analysis. The second part then uses these best practices to guide construction of such diffeomorphisms through deep learning. As a proof of concept, different types of pullback geometries -- among which the proposed construction -- are tested on several data analysis tasks and on several toy data sets. The numerical experiments confirm the predictions from theory, i.e., that the diffeomorphisms generating the pullback geometry need to map the data manifold into a geodesic subspace of the pulled back Riemannian manifold while preserving local isometry around the data manifold for proper, stable and efficient data analysis, and that pulling back positive curvature can be problematic in terms of stability.
The presented methodology for testing the goodness-of-fit of an Autoregressive Hilbertian model (ARH(1) model) provides an infinite-dimensional formulation of the approach proposed in Koul and Stute (1999), based on empirical process marked by residuals. Applying a central and functional central limit result for Hilbert-valued martingale difference sequences, the asymptotic behavior of the formulated H-valued empirical process, also indexed by H, is obtained under the null hypothesis. The limiting process is H-valued generalized (i.e., indexed by H) Wiener process, leading to an asymptotically distribution free test. Consistency of the test is also proved. The case of misspecified autocorrelation operator of the ARH(1) process is addressed. The asymptotic equivalence in probability, uniformly in the norm of H, of the empirical processes formulated under known and unknown autocorrelation operator is obtained. Beyond the Euclidean setting, this approach allows to implement goodness of fit testing in the context of manifold and spherical functional autoregressive processes.
This article is concerned with the multilevel Monte Carlo (MLMC) methods for approximating expectations of some functions of the solution to the Heston 3/2-model from mathematical finance, which takes values in $(0, \infty)$ and possesses superlinearly growing drift and diffusion coefficients. To discretize the SDE model, a new Milstein-type scheme is proposed to produce independent sample paths. The proposed scheme can be explicitly solved and is positivity-preserving unconditionally, i.e., for any time step-size $h>0$. This positivity-preserving property for large discretization time steps is particularly desirable in the MLMC setting. Furthermore, a mean-square convergence rate of order one is proved in the non-globally Lipschitz regime, which is not trivial, as the diffusion coefficient grows super-linearly. The obtained order-one convergence in turn promises the desired relevant variance of the multilevel estimator and justifies the optimal complexity $\mathcal{O}(\epsilon^{-2})$ for the MLMC approach, where $\epsilon > 0$ is the required target accuracy. Numerical experiments are finally reported to confirm the theoretical findings.
Machine learning classification methods usually assume that all possible classes are sufficiently present within the training set. Due to their inherent rarities, extreme events are always under-represented and classifiers tailored for predicting extremes need to be carefully designed to handle this under-representation. In this paper, we address the question of how to assess and compare classifiers with respect to their capacity to capture extreme occurrences. This is also related to the topic of scoring rules used in forecasting literature. In this context, we propose and study a risk function adapted to extremal classifiers. The inferential properties of our empirical risk estimator are derived under the framework of multivariate regular variation and hidden regular variation. A simulation study compares different classifiers and indicates their performance with respect to our risk function. To conclude, we apply our framework to the analysis of extreme river discharges in the Danube river basin. The application compares different predictive algorithms and test their capacity at forecasting river discharges from other river stations.
Most state-of-the-art machine learning techniques revolve around the optimisation of loss functions. Defining appropriate loss functions is therefore critical to successfully solving problems in this field. We present a survey of the most commonly used loss functions for a wide range of different applications, divided into classification, regression, ranking, sample generation and energy based modelling. Overall, we introduce 33 different loss functions and we organise them into an intuitive taxonomy. Each loss function is given a theoretical backing and we describe where it is best used. This survey aims to provide a reference of the most essential loss functions for both beginner and advanced machine learning practitioners.