Modeling symptom progression to identify informative subjects for a new Huntington's disease clinical trial is problematic since time to diagnosis, a key covariate, can be heavily censored. Imputation is an appealing strategy where censored covariates are replaced with their conditional means, but existing methods saw over 200% bias under heavy censoring. Calculating these conditional means well requires estimating and then integrating over the survival function of the censored covariate from the censored value to infinity. To estimate the survival function flexibly, existing methods use the semiparametric Cox model with Breslow's estimator, leaving the integrand for the conditional means (the estimated survival function) undefined beyond the observed data. The integral is then estimated up to the largest observed covariate value, and this approximation can cut off the tail of the survival function and lead to severe bias, particularly under heavy censoring. We propose a hybrid approach that splices together the semiparametric survival estimator with a parametric extension, making it possible to approximate the integral up to infinity. In simulation studies, our proposed approach of extrapolation then imputation substantially reduces the bias seen with existing imputation methods, even when the parametric extension was misspecified. We further demonstrate how imputing with corrected conditional means helps to prioritize patients for future clinical trials.
Neural ordinary differential equations (neural ODEs) have emerged as a natural tool for supervised learning from a control perspective, yet a complete understanding of their optimal architecture remains elusive. In this work, we examine the interplay between their width $p$ and number of layer transitions $L$ (effectively the depth $L+1$). Specifically, we assess the model expressivity in terms of its capacity to interpolate either a finite dataset $D$ comprising $N$ pairs of points or two probability measures in $\mathbb{R}^d$ within a Wasserstein error margin $\varepsilon>0$. Our findings reveal a balancing trade-off between $p$ and $L$, with $L$ scaling as $O(1+N/p)$ for dataset interpolation, and $L=O\left(1+(p\varepsilon^d)^{-1}\right)$ for measure interpolation. In the autonomous case, where $L=0$, a separate study is required, which we undertake focusing on dataset interpolation. We address the relaxed problem of $\varepsilon$-approximate controllability and establish an error decay of $\varepsilon\sim O(\log(p)p^{-1/d})$. This decay rate is a consequence of applying a universal approximation theorem to a custom-built Lipschitz vector field that interpolates $D$. In the high-dimensional setting, we further demonstrate that $p=O(N)$ neurons are likely sufficient to achieve exact control.
The interpretability of models has become a crucial issue in Machine Learning because of algorithmic decisions' growing impact on real-world applications. Tree ensemble methods, such as Random Forests or XgBoost, are powerful learning tools for classification tasks. However, while combining multiple trees may provide higher prediction quality than a single one, it sacrifices the interpretability property resulting in "black-box" models. In light of this, we aim to develop an interpretable representation of a tree-ensemble model that can provide valuable insights into its behavior. First, given a target tree-ensemble model, we develop a hierarchical visualization tool based on a heatmap representation of the forest's feature use, considering the frequency of a feature and the level at which it is selected as an indicator of importance. Next, we propose a mixed-integer linear programming (MILP) formulation for constructing a single optimal multivariate tree that accurately mimics the target model predictions. The goal is to provide an interpretable surrogate model based on oblique hyperplane splits, which uses only the most relevant features according to the defined forest's importance indicators. The MILP model includes a penalty on feature selection based on their frequency in the forest to further induce sparsity of the splits. The natural formulation has been strengthened to improve the computational performance of {mixed-integer} software. Computational experience is carried out on benchmark datasets from the UCI repository using a state-of-the-art off-the-shelf solver. Results show that the proposed model is effective in yielding a shallow interpretable tree approximating the tree-ensemble decision function.
Multivariate spatio-temporal models are widely applicable, but specifying their structure is complicated and may inhibit wider use. We introduce the R package tinyVAST from two viewpoints: the software user and the statistician. From the user viewpoint, tinyVAST adapts a widely used formula interface to specify generalized additive models, and combines this with arguments to specify spatial and spatio-temporal interactions among variables. These interactions are specified using arrow notation (from structural equation models), or an extended arrow-and-lag notation that allows simultaneous, lagged, and recursive dependencies among variables over time. The user also specifies a spatial domain for areal (gridded), continuous (point-count), or stream-network data. From the statistician viewpoint, tinyVAST constructs sparse precision matrices representing multivariate spatio-temporal variation, and parameters are estimated by specifying a generalized linear mixed model (GLMM). This expressive interface encompasses vector autoregressive, empirical orthogonal functions, spatial factor analysis, and ARIMA models. To demonstrate, we fit to data from two survey platforms sampling corals, sponges, rockfishes, and flatfishes in the Gulf of Alaska and Aleutian Islands. We then compare eight alternative model structures using different assumptions about habitat drivers and survey detectability. Model selection suggests that towed-camera and bottom trawl gears have spatial variation in detectability but sample the same underlying density of flatfishes and rockfishes, and that rockfishes are positively associated with sponges while flatfishes are negatively associated with corals. We conclude that tinyVAST can be used to test complicated dependencies representing alternative structural assumptions for research and real-world policy evaluation.
The rise of AI in human contexts places new demands on automated systems to be transparent and explainable. We examine some anthropomorphic ideas and principles relevant to such accountablity in order to develop a theoretical framework for thinking about digital systems in complex human contexts and the problem of explaining their behaviour. Structurally, systems are made of modular and hierachical components, which we abstract in a new system model using notions of modes and mode transitions. A mode is an independent component of the system with its own objectives, monitoring data, and algorithms. The behaviour of a mode, including its transitions to other modes, is determined by functions that interpret each mode's monitoring data in the light of its objectives and algorithms. We show how these belief functions can help explain system behaviour by visualising their evaluation as trajectories in higher-dimensional geometric spaces. These ideas are formalised mathematically by abstract and concrete simplicial complexes. We offer three techniques: a framework for design heuristics, a general system theory based on modes, and a geometric visualisation, and apply them in three types of human-centred systems.
We provide a fine classification of bisimilarities between states of possibly different labelled Markov processes (LMP). We show that a bisimilarity relation proposed by Panangaden that uses direct sums coincides with "event bisimilarity" from his joint work with Danos, Desharnais, and Laviolette. We also extend Giorgio Bacci's notions of bisimilarity between two different processes to the case of nondeterministic LMP and generalize the game characterization of state bisimilarity by Clerc et al. for the latter.
Using well-known mathematical problems for encryption is a widely used technique because they are computationally hard and provide security against potential attacks on the encryption method. The subset sum problem (SSP) can be defined as finding a subset of integers from a given set, whose sum is equal to a specified integer. The classic SSP has various variants, one of which is the multiple-subset problem (MSSP). In the MSSP, the goal is to select items from a given set and distribute them among multiple bins, en-suring that the capacity of each bin is not exceeded while maximizing the total weight of the selected items. This approach addresses a related problem with a different perspective. Here a related different kind of problem is approached: given a set of sets A={A1, A2..., An}, find an integer s for which every subset of the given sets is summed up to, if such an integer exists. The problem is NP-complete when considering it as a variant of SSP. However, there exists an algorithm that is relatively efficient for known pri-vate keys. This algorithm is based on dispensing non-relevant values of the potential sums. In this paper we present the encryption scheme based on MSSP and present its novel usage and implementation in communication.
Auditory spatial attention detection (ASAD) aims to decode the attended spatial location with EEG in a multiple-speaker setting. ASAD methods are inspired by the brain lateralization of cortical neural responses during the processing of auditory spatial attention, and show promising performance for the task of auditory attention decoding (AAD) with neural recordings. In the previous ASAD methods, the spatial distribution of EEG electrodes is not fully exploited, which may limit the performance of these methods. In the present work, by transforming the original EEG channels into a two-dimensional (2D) spatial topological map, the EEG data is transformed into a three-dimensional (3D) arrangement containing spatial-temporal information. And then a 3D deep convolutional neural network (DenseNet-3D) is used to extract temporal and spatial features of the neural representation for the attended locations. The results show that the proposed method achieves higher decoding accuracy than the state-of-the-art (SOTA) method (94.3% compared to XANet's 90.6%) with 1-second decision window for the widely used KULeuven (KUL) dataset, and the code to implement our work is available on Github: //github.com/xuxiran/ASAD_DenseNet
The optimal one-sided parametric polynomial approximants of a circular arc are considered. More precisely, the approximant must be entirely in or out of the underlying circle of an arc. The natural restriction to an arc's approximants interpolating boundary points is assumed. However, the study of approximants, which additionally interpolate corresponding tangent directions and curvatures at the boundary of an arc, is also considered. Several low-degree polynomial approximants are studied in detail. When several solutions fulfilling the interpolation conditions exist, the optimal one is characterized, and a numerical algorithm for its construction is suggested. Theoretical results are demonstrated with several numerical examples and a comparison with general (i.e. non-one-sided) approximants are provided.
Open sets are central to mathematics, especially analysis and topology, in ways few notions are. In most, if not all, computational approaches to mathematics, open sets are only studied indirectly via their 'codes' or 'representations'. In this paper, we study how hard it is to compute, given an arbitrary open set of reals, the most common representation, i.e. a countable set of open intervals. We work in Kleene's higher-order computability theory, which was historically based on the S1-S9 schemes and which now has an intuitive lambda calculus formulation due to the authors. We establish many computational equivalences between on one hand the 'structure' functional that converts open sets to the aforementioned representation, and on the other hand functionals arising from mainstream mathematics, like basic properties of semi-continuous functions, the Urysohn lemma, and the Tietze extension theorem. We also compare these functionals to known operations on regulated and bounded variation functions, and the Lebesgue measure restricted to closed sets. We obtain a number of natural computational equivalences for the latter involving theorems from mainstream mathematics.
Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. Newly desirable wheat variety traits include disease resistance to reduce pesticide use, adaptation to climate change, resistance to heat and drought stresses, or low gluten content of grains. Wheat breeding experiments are documented by a large body of scientific literature and observational data obtained in-field and under controlled conditions. The cross-referencing of complementary information from the literature and observational data is essential to the study of the genotype-phenotype relationship and to the improvement of wheat selection. The scientific literature on genetic marker-assisted selection describes much information about the genotype-phenotype relationship. However, the variety of expressions used to refer to traits and phenotype values in scientific articles is a hinder to finding information and cross-referencing it. When trained adequately by annotated examples, recent text mining methods perform highly in named entity recognition and linking in the scientific domain. While several corpora contain annotations of human and animal phenotypes, currently, no corpus is available for training and evaluating named entity recognition and entity-linking methods in plant phenotype literature. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 540 PubMed references fully annotated for trait, phenotype, and species named entities using the Wheat Trait and Phenotype Ontology and the species taxonomy of the National Center for Biotechnology Information. A study of the performance of tools trained on the Triticum aestivum trait Corpus shows that the corpus is suitable for the training and evaluation of named entity recognition and linking.