Missing data can lead to inefficiencies and biases in analyses, in particular when data are missing not at random (MNAR). It is thus vital to understand and correctly identify the missing data mechanism. Recovering missing values through a follow up sample allows researchers to conduct hypothesis tests for MNAR, which are not possible when using only the original incomplete data. Investigating how properties of these tests are affected by the follow up sample design is little explored in the literature. Our results provide comprehensive insight into the properties of one such test, based on the commonly used selection model framework. We determine conditions for recovery samples that allow the test to be applied appropriately and effectively, i.e. with known Type I error rates and optimized with respect to power. We thus provide an integrated framework for testing for the presence of MNAR and designing follow up samples in an efficient cost-effective way. The performance of our methodology is evaluated through simulation studies as well as on a real data sample.
We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework of this type, as well as to several individually-designed private hypothesis tests. Our framework is higher power than other generic solutions and at least competitive with (and often better than) individually-designed tests.
The two-sample problem, which consists in testing whether independent samples on $\mathbb{R}^d$ are drawn from the same (unknown) distribution, finds applications in many areas. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various sources, poorly controlled, leading to datasets possibly exhibiting a strong sampling bias. While classic methods relying on the computation of a discrepancy measure between the empirical distributions face the curse of dimensionality, we develop an alternative approach based on statistical learning and extending rank tests, capable of detecting small departures from the null assumption in the univariate case when appropriately designed. Overcoming the lack of natural order on $\mathbb{R}^d$ when $d\geq 2$, it is implemented in two steps. Assigning to each of the samples a label (positive vs. negative) and dividing them into two parts, a preorder on $\mathbb{R}^d$ defined by a real-valued scoring function is learned by means of a bipartite ranking algorithm applied to the first part and a rank test is applied next to the scores of the remaining observations to detect possible differences in distribution. Because it learns how to project the data onto the real line nearly like (any monotone transform of) the likelihood ratio between the original multivariate distributions would do, the approach is not much affected by the dimensionality, ignoring ranking model bias issues, and preserves the advantages of univariate rank tests. Nonasymptotic error bounds are proved based on recent concentration results for two-sample linear rank-processes and an experimental study shows that the approach promoted surpasses alternative methods standing as natural competitors.
As the availability of omics data has increased in the last few years, more multi-omics data have been generated, that is, high-dimensional molecular data consisting of several types such as genomic, transcriptomic, or proteomic data, all obtained from the same patients. Such data lend themselves to being used as covariates in automatic outcome prediction because each omics type may contribute unique information, possibly improving predictions compared to using only one omics data type. Frequently, however, in the training data and the data to which automatic prediction rules should be applied, the test data, the different omics data types are not available for all patients. We refer to this type of data as block-wise missing multi-omics data. First, we provide a literature review on existing prediction methods applicable to such data. Subsequently, using a collection of 13 publicly available multi-omics data sets, we compare the predictive performances of several of these approaches for different block-wise missingness patterns. Finally, we discuss the results of this empirical comparison study and draw some tentative conclusions.
We discuss two approaches to solving the parametric (or stochastic) eigenvalue problem. One of them uses a Taylor expansion and the other a Chebyshev expansion. The parametric eigenvalue problem assumes that the matrix $A$ depends on a parameter $\mu$, where $\mu$ might be a random variable. Consequently, the eigenvalues and eigenvectors are also functions of $\mu$. We compute a Taylor approximation of these functions about $\mu_{0}$ by iteratively computing the Taylor coefficients. The complexity of this approach is $O(n^{3})$ for all eigenpairs, if the derivatives of $A(\mu)$ at $\mu_{0}$ are given. The Chebyshev expansion works similarly. We first find an initial approximation iteratively which we then refine with Newton's method. This second method is more expensive but provides a good approximation over the whole interval of the expansion instead around a single point. We present numerical experiments confirming the complexity and demonstrating that the approaches are capable of tracking eigenvalues at intersection points. Further experiments shed light on the limitations of the Taylor expansion approach with respect to the distance from the expansion point $\mu_{0}$.
In Bayesian analysis, the selection of a prior distribution is typically done by considering each parameter in the model. While this can be convenient, in many scenarios it may be desirable to place a prior on a summary measure of the model instead. In this work, we propose a prior on the model fit, as measured by a Bayesian coefficient of determination (R2), which then induces a prior on the individual parameters. We achieve this by placing a beta prior on R2 and then deriving the induced prior on the global variance parameter for generalized linear mixed models. We derive closed-form expressions in many scenarios and present several approximation strategies when an analytic form is not possible and/or to allow for easier computation. In these situations, we suggest approximating the prior by using a generalized beta prime distribution and provide a simple default prior construction scheme. This approach is quite flexible and can be easily implemented in standard Bayesian software. Lastly, we demonstrate the performance of the method on simulated data, where it particularly shines in high-dimensional examples, as well as real-world data, which shows its ability to model spatial correlation in the random effects.
We consider the problem of optimizing the decisions of a preemptively capable transmitter to minimize the Age of Incorrect Information (AoII) when the communication channel has a random delay. In the system, a transmitter observes a Markovian source and makes decisions based on the system status. Time is slotted and normalized. In each time slot, the transmitter decides whether to preempt or skip when the channel is busy. When the channel is idle, the transmitter decides whether to send a new update. At the other end of the channel is a receiver that estimates the state of the Markovian source based on the update it receives. We consider a generic transmission delay and assume that the transmission delay is independent and identically distributed for each update. This paper aims to optimize the transmitter's decision in each time slot to minimize the AoII with generic time penalty functions. To this end, we first use the Markov decision process to formulate the optimization problem and derive the analytical expressions of the expected AoIIs achieved by two canonical preemptive policies. Then, we prove the existence of the optimal policy and provide a feasible value iteration algorithm to approximate the optimal policy. However, the value iteration algorithm will be computationally expensive if we want considerable confidence in the approximation. Therefore, we analyze the system characteristics under two canonical delay distributions and theoretically obtain the corresponding optimal policies using the policy improvement theorem. Finally, numerical results are presented to illustrate the performance improvements brought about by the preemption capability.
Diagnostic tests are almost never perfect. Studies quantifying their performance use knowledge of the true health status, measured with a reference diagnostic test. Researchers commonly assume that the reference test is perfect, which is not the case in practice. When the assumption fails, conventional studies identify "apparent" performance or performance with respect to the reference, but not true performance. This paper provides the smallest possible bounds on the measures of true performance - sensitivity (true positive rate) and specificity (true negative rate), or equivalently false positive and negative rates, in standard settings. Implied bounds on policy-relevant parameters are derived: 1) Prevalence in screened populations; 2) Predictive values. Methods for inference based on moment inequalities are used to construct uniformly consistent confidence sets in level over a relevant family of data distributions. Emergency Use Authorization (EUA) and independent study data for the BinaxNOW COVID-19 antigen test demonstrate that the bounds can be very informative. Analysis reveals that the estimated false negative rates for symptomatic and asymptomatic patients are up to 3.17 and 4.59 times higher than the frequently cited "apparent" false negative rate.
Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make two contributions for solving this problem. First, we analytically show that existing in-processing fair algorithms have fundamental limits in accuracy and group fairness. We introduce the notion of correlation shifts, which can explicitly capture the change of the above bias. Second, we propose a novel pre-processing step that samples the input data to reduce correlation shifts and thus enables the in-processing approaches to overcome their limitations. We formulate an optimization problem for adjusting the data ratio among labels and sensitive groups to reflect the shifted correlation. A key benefit of our approach lies in decoupling the roles of pre- and in-processing approaches: correlation adjustment via pre-processing and unfairness mitigation on the processed data via in-processing. Experiments show that our framework effectively improves existing in-processing fair algorithms w.r.t. accuracy and fairness, both on synthetic and real datasets.
The interconnected smart devices and industrial internet of things devices require low-latency communication to fulfill control objectives despite limited resources. In essence, such devices have a time-critical nature but also require a highly accurate data input based on its significance. In this paper, we investigate various coordinated and distributed semantic scheduling schemes with a data significance perspective. In particular, novel algorithms are proposed to analyze the benefit of such schemes for the significance in terms of estimation accuracy. Then, we derive the bounds of the achievable estimation accuracy. Our numerical results showcase the superiority of semantic scheduling policies that adopt an integrated control and communication strategy. In essence, such policies can reduce the weighted sum of mean squared errors compared to traditional policies.
Integration of Visual Inertial Odometry (VIO) methods into a modular control system designed for deployment of Unmanned Aerial Vehicles (UAVs) and teams of cooperating UAVs in real-world conditions are presented in this paper. Reliability analysis and fair performance comparison of several methods integrated into a control pipeline for achieving full autonomy in real conditions is provided. Although most VIO algorithms achieve excellent localization precision and negligible drift on artificially created datasets, the aspects of reliability in non-ideal situations, robustness to degraded sensor data, and the effects of external disturbances and feedback control coupling are not well studied. These imperfections, which are inherently present in cases of real-world deployment of UAVs, negatively affect the ability of the most used VIO approaches to output a sensible pose estimation. We identify the conditions that are critical for a reliable flight under VIO localization and propose workarounds and compensations for situations in which such conditions cannot be achieved. The performance of the UAV system with integrated VIO methods is quantitatively analyzed w.r.t. RTK ground truth and the ability to provide reliable pose estimation for the feedback control is demonstrated onboard a UAV that is tracking dynamic trajectories under challenging illumination.