Introduction: Even in effectively conducted randomised trials, the probability of a successful study remains relatively low. With recent advances in the next-generation sequencing technologies, there is a rapidly growing number of high-dimensional data, including genetic, molecular and phenotypic information, that have improved our understanding of driver genes, drug targets, and drug mechanisms of action. The leveraging of high-dimensional data holds promise for increased success of clinical trials. Methods: We provide an overview of methods for utilising high-dimensional data in clinical trials. We also investigate the use of these methods in practice through a review of recently published randomised clinical trials that utilise high-dimensional genetic data. The review includes articles that were published between 2019 and 2021, identified through the PubMed database. Results: Out of 174 screened articles, 100 (57.5%) were randomised clinical trials that collected high-dimensional data. The most common clinical area was oncology (30%), followed by chronic diseases (28%), nutrition and ageing (18%) and cardiovascular diseases (7%). The most common types of data analysed were gene expression data (70%), followed by DNA data (21%). The most common method of analysis (36.3%) was univariable analysis. Articles that described multivariable analyses used standard statistical methods. Most of the clinical trials had two arms. Discussion: New methodological approaches are required for more efficient analysis of the increasing amount of high-dimensional data collected in randomised clinical trials. We highlight the limitations and barriers to the current use of high-dimensional data in trials, and suggest potential avenues for improvement and future work.
A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data are becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don't. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don't perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes.
We consider the fundamental task of optimising a real-valued function defined in a potentially high-dimensional Euclidean space, such as the loss function in many machine-learning tasks or the logarithm of the probability distribution in statistical inference. We use Riemannian geometry notions to redefine the optimisation problem of a function on the Euclidean space to a Riemannian manifold with a warped metric, and then find the function's optimum along this manifold. The warped metric chosen for the search domain induces a computational friendly metric-tensor for which optimal search directions associated with geodesic curves on the manifold becomes easier to compute. Performing optimization along geodesics is known to be generally infeasible, yet we show that in this specific manifold we can analytically derive Taylor approximations up to third-order. In general these approximations to the geodesic curve will not lie on the manifold, however we construct suitable retraction maps to pull them back onto the manifold. Therefore, we can efficiently optimize along the approximate geodesic curves. We cover the related theory, describe a practical optimization algorithm and empirically evaluate it on a collection of challenging optimisation benchmarks. Our proposed algorithm, using 3rd-order approximation of geodesics, tends to outperform standard Euclidean gradient-based counterparts in term of number of iterations until convergence.
Recent progress in research on Deep Graph Networks (DGNs) has led to a maturation of the domain of learning on graphs. Despite the growth of this research field, there are still important challenges that are yet unsolved. Specifically, there is an urge of making DGNs suitable for predictive tasks on realworld systems of interconnected entities, which evolve over time. With the aim of fostering research in the domain of dynamic graphs, at first, we survey recent advantages in learning both temporal and spatial information, providing a comprehensive overview of the current state-of-the-art in the domain of representation learning for dynamic graphs. Secondly, we conduct a fair performance comparison among the most popular proposed approaches on node and edge-level tasks, leveraging rigorous model selection and assessment for all the methods, thus establishing a sound baseline for evaluating new architectures and approaches
In this work, we propose an adaptive geometric multigrid method for the solution of large-scale finite cell flow problems. The finite cell method seeks to circumvent the need for a boundary-conforming mesh through the embedding of the physical domain in a regular background mesh. As a result of the intersection between the physical domain and the background computational mesh, the resultant systems of equations are typically numerically ill-conditioned, rendering the appropriate treatment of cutcells a crucial aspect of the solver. To this end, we propose a smoother operator with favorable parallel properties and discuss its memory footprint and parallelization aspects. We propose three cache policies that offer a balance between cached and on-the-fly computation and discuss the optimization opportunities offered by the smoother operator. It is shown that the smoother operator, on account of its additive nature, can be replicated in parallel exactly with little communication overhead, which offers a major advantage in parallel settings as the geometric multigrid solver is consequently independent of the number of processes. The convergence and scalability of the geometric multigrid method is studied using numerical examples. It is shown that the iteration count of the solver remains bounded independent of the problem size and depth of the grid hierarchy. The solver is shown to obtain excellent weak and strong scaling using numerical benchmarks with more than 665 million degrees of freedom. The presented geometric multigrid solver is, therefore, an attractive option for the solution of large-scale finite cell problems in massively parallel high-performance computing environments.
Symplectic integrators are widely implemented numerical integrators for Hamiltonian mechanics, which preserve the Hamiltonian structure (symplecticity) of the system. Although the symplectic integrator does not conserve the energy of the system, it is well known that there exists a conserving modified Hamiltonian, called the shadow Hamiltonian. For the Nambu mechanics, which is one of the generalized Hamiltonian mechanics, we can also construct structure-preserving integrators by the same procedure used to construct the symplectic integrators. In the structure-preserving integrator, however, the existence of shadow Hamiltonians is non-trivial. This is because the Nambu mechanics is driven by multiple Hamiltonians and it is non-trivial whether the time evolution by the integrator can be cast into the Nambu mechanical time evolution driven by multiple shadow Hamiltonians. In the present paper we construct structure-preserving integrators for a simple Nambu mechanical system, and derive the shadow Hamiltonians in two ways. This is the first attempt to derive shadow Hamiltonians of structure-preserving integrators for Nambu mechanics. We show that the fundamental identity, which corresponds to the Jacobi identity in Hamiltonian mechanics, plays an important role to calculate the shadow Hamiltonians using the Baker-Campbell-Hausdorff formula. It turns out that the resulting shadow Hamiltonians have indefinite forms depending on how the fundamental identities are used. This is not a technical artifact, because the exact shadow Hamiltonians obtained independently have the same indefiniteness.
Continuous glucose monitoring (CGM) is a minimally invasive technology that allows continuous monitoring of an individual's blood glucose. We focus on a large clinical trial that collected CGM data every few minutes for 26 weeks and assumes that the basic observation unit is the distribution of CGM observations in a four-week interval. The resulting data structure is multilevel (because each individual has multiple months of data) and distributional (because the data for each four-week interval is represented as a distribution). The scientific goals are to: (1) identify and quantify the effects of factors that affect glycemic control in type 1 diabetes (T1D) patients; and (2) identify and characterize the patients who respond to treatment. To address these goals, we propose a new multilevel functional model that treats the CGM distributions as a response. Methods are motivated by and applied to data collected by The Juvenile Diabetes Research Foundation Continuous Glucose Monitoring Group. Reproducible code for the methods introduced here is available on GitHub.
Large-scale applications of Visual Place Recognition (VPR) require computationally efficient approaches. Further, a well-balanced combination of data-based and training-free approaches can decrease the required amount of training data and effort and can reduce the influence of distribution shifts between the training and application phases. This paper proposes a runtime and data-efficient hierarchical VPR pipeline that extends existing approaches and presents novel ideas. There are three main contributions: First, we propose Local Positional Graphs (LPG), a training-free and runtime-efficient approach to encode spatial context information of local image features. LPG can be combined with existing local feature detectors and descriptors and considerably improves the image-matching quality compared to existing techniques in our experiments. Second, we present Attentive Local SPED (ATLAS), an extension of our previous local features approach with an attention module that improves the feature quality while maintaining high data efficiency. The influence of the proposed modifications is evaluated in an extensive ablation study. Third, we present a hierarchical pipeline that exploits hyperdimensional computing to use the same local features as holistic HDC-descriptors for fast candidate selection and for candidate reranking. We combine all contributions in a runtime and data-efficient VPR pipeline that shows benefits over the state-of-the-art method Patch-NetVLAD on a large collection of standard place recognition datasets with 15$\%$ better performance in VPR accuracy, 54$\times$ faster feature comparison speed, and 55$\times$ less descriptor storage occupancy, making our method promising for real-world high-performance large-scale VPR in changing environments. Code will be made available with publication of this paper.
There are now many explainable AI methods for understanding the decisions of a machine learning model. Among these are those based on counterfactual reasoning, which involve simulating features changes and observing the impact on the prediction. This article proposes to view this simulation process as a source of creating a certain amount of knowledge that can be stored to be used, later, in different ways. This process is illustrated in the additive model and, more specifically, in the case of the naive Bayes classifier, whose interesting properties for this purpose are shown.
Graph-based approximation methods are of growing interest in many areas, including transportation, biological and chemical networks, financial models, image processing, network flows, and more. In these applications, often a basis for the approximation space is not available analytically and must be computed. We propose perturbations of Lagrange bases on graphs, where the Lagrange functions come from a class of functions analogous to classical splines. The basis functions we consider have local support, with each basis function obtained by solving a small energy minimization problem related to a differential operator on the graph. We present $\ell_\infty$ error estimates between the local basis and the corresponding interpolatory Lagrange basis functions in cases where the underlying graph satisfies a mild assumption on the connections of vertices where the function is not known, and the theoretical bounds are examined further in numerical experiments. Included in our analysis is a mixed-norm inequality for positive definite matrices that is tighter than the general estimate $\|A\|_{\infty} \leq \sqrt{n} \|A\|_{2}$.
Predictions in the form of probability distributions are crucial for decision-making. Quantile regression enables this within spatial interpolation settings for merging remote sensing and gauge precipitation data. However, ensemble learning of quantile regression algorithms remains unexplored in this context. Here, we address this gap by introducing nine quantile-based ensemble learners and applying them to large precipitation datasets. We employed a novel feature engineering strategy, reducing predictors to distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six stacking and three simple methods (mean, median, best combiner), combining six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different stacking methods. We evaluated performance against QR using quantile scoring functions in a large dataset comprising 15 years of monthly gauge-measured and satellite precipitation in contiguous US (CONUS). Stacking with QR and QRNN yielded the best results across quantile levels of interest (0.025, 0.050, 0.075, 0.100, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 0.925, 0.950, 0.975), surpassing the reference method by 3.91% to 8.95%. This demonstrates the potential of stacking to improve probabilistic predictions in spatial interpolation and beyond.