Data complexity analysis quantifies the hardness of constructing a predictive model on a given dataset. However, the effectiveness of existing data complexity measures can be challenged by the existence of irrelevant features and feature interactions in biological micro-array data. We propose a novel data complexity measure, depth, that leverages an evolutionary inspired feature selection algorithm to quantify the complexity of micro-array data. By examining feature subsets of varying sizes, the approach offers a novel perspective on data complexity analysis. Unlike traditional metrics, depth is robust to irrelevant features and effectively captures complexity stemming from feature interactions. On synthetic micro-array data, depth outperforms existing methods in robustness to irrelevant features and identifying complexity from feature interactions. Applied to case-control genotype and gene-expression micro-array datasets, the results reveal that a single feature of gene-expression data can account for over 90% of the performance of multi-feature model, confirming the adequacy of the commonly used differentially expressed gene (DEG) feature selection method for the gene expression data. Our study also demonstrates that constructing predictive models for genotype data is harder than gene expression data. The results in this paper provide evidence for the use of interpretable machine learning algorithms on microarray data.
Approximated forms of the RII and RIII redistribution matrices are frequently applied to simplify the numerical solution of the radiative transfer problem for polarized radiation, taking partial frequency redistribution (PRD) effects into account. A widely used approximation for RIII is to consider its expression under the assumption of complete frequency redistribution (CRD) in the observer frame (RIII CRD). The adequacy of this approximation for modeling the intensity profiles has been firmly established. By contrast, its suitability for modeling scattering polarization signals has only been analyzed in a few studies, considering simplified settings. In this work, we aim at quantitatively assessing the impact and the range of validity of the RIII CRD approximation in the modeling of scattering polarization. Methods. We first present an analytic comparison between RIII and RIII CRD. We then compare the results of radiative transfer calculations, out of local thermodynamic equilibrium, performed with RIII and RIII CRD in realistic 1D atmospheric models. We focus on the chromospheric Ca i line at 4227 A and on the photospheric Sr i line at 4607 A.
This article is concerned with a regularity analysis of parametric operator equations with a perspective on uncertainty quantification. We study the regularity of mappings between Banach spaces near branches of isolated solutions that are implicitly defined by a residual equation. Under $s$-Gevrey assumptions on on the residual equation, we establish $s$-Gevrey bounds on the Fr\'echet derivatives of the local data-to-solution mapping. This abstract framework is illustrated in a proof of regularity bounds for a semilinear elliptic partial differential equation with parametric and random field input.
Nowadays, numerical models are widely used in most of engineering fields to simulate the behaviour of complex systems, such as for example power plants or wind turbine in the energy sector. Those models are nevertheless affected by uncertainty of different nature (numerical, epistemic) which can affect the reliability of their predictions. We develop here a new method for quantifying conditional parameter uncertainty within a chain of two numerical models in the context of multiphysics simulation. More precisely, we aim to calibrate the parameters $\theta$ of the second model of the chain conditionally on the value of parameters $\lambda$ of the first model, while assuming the probability distribution of $\lambda$ is known. This conditional calibration is carried out from the available experimental data of the second model. In doing so, we aim to quantify as well as possible the impact of the uncertainty of $\lambda$ on the uncertainty of $\theta$. To perform this conditional calibration, we set out a nonparametric Bayesian formalism to estimate the functional dependence between $\theta$ and $\lambda$, denoted by $\theta(\lambda)$. First, each component of $\theta(\lambda)$ is assumed to be the realization of a Gaussian process prior. Then, if the second model is written as a linear function of $\theta(\lambda)$, the Bayesian machinery allows us to compute analytically the posterior predictive distribution of $\theta(\lambda)$ for any set of realizations $\lambda$. The effectiveness of the proposed method is illustrated on several analytical examples.
Semitopologies model consensus in distributed system by equating the notion of a quorum -- a set of participants sufficient to make local progress -- with that of an open set. This yields a topology-like theory of consensus, but semitopologies generalise topologies, since the intersection of two quorums need not necessarily be a quorum. The semitopological model of consensus is naturally heterogeneous and local, just like topologies can be heterogenous and local, and for the same reasons: points may have different quorums and there is no restriction that open sets / quorums be uniformly generated (e.g. open sets can be something other than two-thirds majorities of the points in the space). Semiframes are an algebraic abstraction of semitopologies. They are to semitopologies as frames are to topologies. We give a notion of semifilter, which plays a role analogous to filters, and show how to build a semiframe out of the open sets of a semitopology, and a semitopology out of the semifilters of a semiframe. We define suitable notions of category and morphism and prove a categorical duality between (sober) semiframes and (spatial) semitopologies, and investigate well-behavedness properties on semitopologies and semiframes across the duality. Surprisingly, the structure of semiframes is not what one might initially expect just from looking at semitopologies, and the canonical structure required for the duality result -- a compatibility relation *, generalising sets intersection -- is also canonical for expressing well-behavedness properties. Overall, we deliver a new categorical, algebraic, abstract framework within which to study consensus on distributed systems, and which is also simply interesting to consider as a mathematical theory in its own right.
Network motifs are recurrent, small-scale patterns of interactions observed frequently in a system. They shed light on the interplay between the topology and the dynamics of complex networks across various domains. In this work, we focus on the problem of counting occurrences of small sub-hypergraph patterns in very large hypergraphs, where higher-order interactions connect arbitrary numbers of system units. We show how directly exploiting higher-order structures speeds up the counting process compared to traditional data mining techniques for exact motif discovery. Moreover, with hyperedge sampling, performance is further improved at the cost of small errors in the estimation of motif frequency. We evaluate our method on several real-world datasets describing face-to-face interactions, co-authorship and human communication. We show that our approximated algorithm allows us to extract higher-order motifs faster and on a larger scale, beyond the computational limits of an exact approach.
The alpha complex is a fundamental data structure from computational geometry, which encodes the topological type of a union of balls $B(x; r) \subset \mathbb{R}^m$ for $x\in S$, including a weighted version that allows for varying radii. It consists of the collection of "simplices" $\sigma = \{x_0, ..., x_k \} \subset S$, which correspond to nomempty $(k + 1)$-fold intersections of cells in a radius-restricted version of the Voronoi diagram. Existing algorithms for computing the alpha complex require that the points reside in low dimension because they begin by computing the entire Delaunay complex, which rapidly becomes intractable, even when the alpha complex is of a reasonable size. This paper presents a method for computing the alpha complex without computing the full Delaunay triangulation by applying Lagrangian duality, specifically an algorithm based on dual quadratic programming that seeks to rule simplices out rather than ruling them in.
Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods and show that it is not just a function of parameter count, but rather a function of the singular values of the design or the kernel matrix and the signal-to-noise ratio. For a linear model with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors, MDL-COMP scales linearly with $d$ when $d<n$, but the scaling is exponentially smaller -- $\log d$ for $d>n$. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP upper bounds the in-sample mean squared error (MSE). Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs. Finally, our findings also suggest that the recently observed double decent phenomenons in overparameterized models might be a consequence of the choice of non-ideal estimators.
The covXtreme software provides functionality for estimation of marginal and conditional extreme value models, non-stationary with respect to covariates, and environmental design contours. Generalised Pareto (GP) marginal models of peaks over threshold are estimated, using a piecewise-constant representation for the variation of GP threshold and scale parameters on the (potentially multidimensional) covariate domain of interest. The conditional variation of one or more associated variates, given a large value of a single conditioning variate, is described using the conditional extremes model of Heffernan and Tawn (2004), the slope term of which is also assumed to vary in a piecewise constant manner with covariates. Optimal smoothness of marginal and conditional extreme value model parameters with respect to covariates is estimated using cross-validated roughness-penalised maximum likelihood estimation. Uncertainties in model parameter estimates due to marginal and conditional extreme value threshold choice, and sample size, are quantified using a bootstrap resampling scheme. Estimates of environmental contours using various schemes, including the direct sampling approach of Huseby et al. 2013, are calculated by simulation or numerical integration under fitted models. The software was developed in MATLAB for metocean applications, but is applicable generally to multivariate samples of peaks over threshold. The software can be downloaded from GitHub, with an accompanying user guide.
We present an efficient preconditioner for linear problems $A x=y$. It guarantees monotonic convergence of the memory-efficient fixed-point iteration for all accretive systems of the form $A = L + V$, where $L$ is an approximation of $A$, and the system is scaled so that the discrepancy is bounded with $\lVert V \rVert<1$. In contrast to common splitting preconditioners, our approach is not restricted to any particular splitting. Therefore, the approximate problem can be chosen so that an analytic solution is available to efficiently evaluate the preconditioner. We prove that the only preconditioner with this property has the form $(L+I)(I - V)^{-1}$. This unique form moreover permits the elimination of the forward problem from the preconditioned system, often halving the time required per iteration. We demonstrate and evaluate our approach for wave problems, diffusion problems, and pantograph delay differential equations. With the latter we show how the method extends to general, not necessarily accretive, linear systems.
Gaussian processes (GPs) are popular nonparametric statistical models for learning unknown functions and quantifying the spatiotemporal uncertainty in data. Recent works have extended GPs to model scalar and vector quantities distributed over non-Euclidean domains, including smooth manifolds appearing in numerous fields such as computer vision, dynamical systems, and neuroscience. However, these approaches assume that the manifold underlying the data is known, limiting their practical utility. We introduce RVGP, a generalisation of GPs for learning vector signals over latent Riemannian manifolds. Our method uses positional encoding with eigenfunctions of the connection Laplacian, associated with the tangent bundle, readily derived from common graph-based approximation of data. We demonstrate that RVGP possesses global regularity over the manifold, which allows it to super-resolve and inpaint vector fields while preserving singularities. Furthermore, we use RVGP to reconstruct high-density neural dynamics derived from low-density EEG recordings in healthy individuals and Alzheimer's patients. We show that vector field singularities are important disease markers and that their reconstruction leads to a comparable classification accuracy of disease states to high-density recordings. Thus, our method overcomes a significant practical limitation in experimental and clinical applications.