High-dimensional data are routinely collected in many areas. We are particularly interested in Bayesian classification models in which one or more variables are imbalanced. Current Markov chain Monte Carlo algorithms for posterior computation are inefficient as $n$ and/or $p$ increase due to worsening time per step and mixing rates. One strategy is to use a gradient-based sampler to improve mixing while using data sub-samples to reduce per-step computational complexity. However, usual sub-sampling breaks down when applied to imbalanced data. Instead, we generalize piece-wise deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch sub-sampling. These approaches maintain the correct stationary distribution with arbitrarily small sub-samples, and substantially outperform current competitors. We provide theoretical support and illustrate gains in simulated and real data applications.
Data depth functions have been intensively studied for normed vector spaces. However, a discussion on depth functions on data where one specific data structure cannot be presupposed is lacking. In this article, we introduce a notion of depth functions for data types that are not given in statistical standard data formats and therefore we do not have one specific data structure. We call such data in general non-standard data. To achieve this, we represent the data via formal concept analysis which leads to a unified data representation. Besides introducing depth functions for non-standard data using formal concept analysis, we give a systematic basis by introducing structural properties. Furthermore, we embed the generalised Tukey depth into our concept of data depth and analyse it using the introduced structural properties. Thus, this article provides the mathematical formalisation of centrality and outlyingness for non-standard data and therefore increases the spaces centrality is currently discussed. In particular, it gives a basis to define further depth functions and statistical inference methods for non-standard data.
We introduce a fine-grained framework for uncertainty quantification of predictive models under distributional shifts. This framework distinguishes the shift in covariate distributions from that in the conditional relationship between the outcome ($Y$) and the covariates ($X$). We propose to reweight the training samples to adjust for an identifiable covariate shift while protecting against worst-case conditional distribution shift bounded in an $f$-divergence ball. Based on ideas from conformal inference and distributionally robust learning, we present an algorithm that outputs (approximately) valid and efficient prediction intervals in the presence of distributional shifts. As a use case, we apply the framework to sensitivity analysis of individual treatment effects with hidden confounding. The proposed methods are evaluated in simulation studies and three real data applications, demonstrating superior robustness and efficiency compared with existing benchmarks.
In practical applications, effectively segmenting cracks in large-scale computed tomography (CT) images holds significant importance for understanding the structural integrity of materials. However, classical methods and Machine Learning algorithms often incur high computational costs when dealing with the substantial size of input images. Hence, a robust algorithm is needed to pre-detect crack regions, enabling focused analysis and reducing computational overhead. The proposed approach addresses this challenge by offering a streamlined method for identifying crack regions in CT images with high probability. By efficiently identifying areas of interest, our algorithm allows for a more focused examination of potential anomalies within the material structure. Through comprehensive testing on both semi-synthetic and real 3D CT images, we validate the efficiency of our approach in enhancing crack segmentation while reducing computational resource requirements.
Testing for independence between two random vectors is a fundamental problem in statistics. It is observed from empirical studies that many existing omnibus consistent tests may not work well for some strongly nonmonotonic and nonlinear relationships. To explore the reasons behind this issue, we novelly transform the multivariate independence testing problem equivalently into checking the equality of two bivariate means. An important observation we made is that the power loss is mainly due to cancellation of positive and negative terms in dependence metrics, making them very close to zero. Motivated by this observation, we propose a class of consistent metrics with a positive integer $\gamma$ that exactly characterize independence. Theoretically, we show that the metrics with even and infinity $\gamma$ can effectively avoid the cancellation, and have high powers under the alternatives that two mean differences offset each other. Since we target at a wide range of dependence scenarios in practice, we further suggest to combine the p-values of test statistics with different $\gamma$'s through the Fisher's method. We illustrate the advantages of our proposed tests through extensive numerical studies.
Interpolation of data on non-Euclidean spaces is an active research area fostered by its numerous applications. This work considers the Hermite interpolation problem: finding a sufficiently smooth manifold curve that interpolates a collection of data points on a Riemannian manifold while matching a prescribed derivative at each point. We propose a novel procedure relying on the general concept of retractions to solve this problem on a large class of manifolds, including those for which computing the Riemannian exponential or logarithmic maps is not straightforward, such as the manifold of fixed-rank matrices. We analyze the well-posedness of the method by introducing and showing the existence of retraction-convex sets, a generalization of geodesically convex sets. We extend to the manifold setting a classical result on the asymptotic interpolation error of Hermite interpolation. We finally illustrate these results and the effectiveness of the method with numerical experiments on the manifold of fixed-rank matrices and the Stiefel manifold of matrices with orthonormal columns.
In a regression model with multiple response variables and multiple explanatory variables, if the difference of the mean vectors of the response variables for different values of explanatory variables is always in the direction of the first principal eigenvector of the covariance matrix of the response variables, then it is called a multivariate allometric regression model. This paper studies the estimation of the first principal eigenvector in the multivariate allometric regression model. A class of estimators that includes conventional estimators is proposed based on weighted sum-of-squares matrices of regression sum-of-squares matrix and residual sum-of-squares matrix. We establish an upper bound of the mean squared error of the estimators contained in this class, and the weight value minimizing the upper bound is derived. Sufficient conditions for the consistency of the estimators are discussed in weak identifiability regimes under which the difference of the largest and second largest eigenvalues of the covariance matrix decays asymptotically and in ``large $p$, large $n$" regimes, where $p$ is the number of response variables and $n$ is the sample size. Several numerical results are also presented.
Randomized matrix algorithms have become workhorse tools in scientific computing and machine learning. To use these algorithms safely in applications, they should be coupled with posterior error estimates to assess the quality of the output. To meet this need, this paper proposes two diagnostics: a leave-one-out error estimator for randomized low-rank approximations and a jackknife resampling method to estimate the variance of the output of a randomized matrix computation. Both of these diagnostics are rapid to compute for randomized low-rank approximation algorithms such as the randomized SVD and randomized Nystr\"om approximation, and they provide useful information that can be used to assess the quality of the computed output and guide algorithmic parameter choices.
Digital credentials represent a cornerstone of digital identity on the Internet. To achieve privacy, certain functionalities in credentials should be implemented. One is selective disclosure, which allows users to disclose only the claims or attributes they want. This paper presents a novel approach to selective disclosure that combines Merkle hash trees and Boneh-Lynn-Shacham (BLS) signatures. Combining these approaches, we achieve selective disclosure of claims in a single credential and creation of a verifiable presentation containing selectively disclosed claims from multiple credentials signed by different parties. Besides selective disclosure, we enable issuing credentials signed by multiple issuers using this approach.
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
Mendelian randomization uses genetic variants as instrumental variables to make causal inferences about the effects of modifiable risk factors on diseases from observational data. One of the major challenges in Mendelian randomization is that many genetic variants are only modestly or even weakly associated with the risk factor of interest, a setting known as many weak instruments. Many existing methods, such as the popular inverse-variance weighted (IVW) method, could be biased when the instrument strength is weak. To address this issue, the debiased IVW (dIVW) estimator, which is shown to be robust to many weak instruments, was recently proposed. However, this estimator still has non-ignorable bias when the effective sample size is small. In this paper, we propose a modified debiased IVW (mdIVW) estimator by multiplying a modification factor to the original dIVW estimator. After this simple correction, we show that the bias of the mdIVW estimator converges to zero at a faster rate than that of the dIVW estimator under some regularity conditions. Moreover, the mdIVW estimator has smaller variance than the dIVW estimator.We further extend the proposed method to account for the presence of instrumental variable selection and balanced horizontal pleiotropy. We demonstrate the improvement of the mdIVW estimator over the dIVW estimator through extensive simulation studies and real data analysis.