Multistate current status (CS) data presents a more severe form of censoring due to the single observation of study participants transitioning through a sequence of well-defined disease states at random inspection times. Moreover, these data may be clustered within specified groups, and informativeness of the cluster sizes may arise due to the existing latent relationship between the transition outcomes and the cluster sizes. Failure to adjust for this informativeness may lead to a biased inference. Motivated by a clinical study of periodontal disease (PD), we propose an extension of the pseudo-value approach to estimate covariate effects on the state occupation probabilities (SOP) for these clustered multistate CS data with informative cluster or intra-cluster group sizes. In our approach, the proposed pseudo-value technique initially computes marginal estimators of the SOP utilizing nonparametric regression. Next, the estimating equations based on the corresponding pseudo-values are reweighted by functions of the cluster sizes to adjust for informativeness. We perform a variety of simulation studies to study the properties of our pseudo-value regression based on the nonparametric marginal estimators under different scenarios of informativeness. For illustration, the method is applied to the motivating PD dataset, which encapsulates the complex data-generation mechanism.
A general a posteriori error analysis applies to five lowest-order finite element methods for two fourth-order semi-linear problems with trilinear non-linearity and a general source. A quasi-optimal smoother extends the source term to the discrete trial space, and more importantly, modifies the trilinear term in the stream-function vorticity formulation of the incompressible 2D Navier-Stokes and the von K\'{a}rm\'{a}n equations. This enables the first efficient and reliable a posteriori error estimates for the 2D Navier-Stokes equations in the stream-function vorticity formulation for Morley, two discontinuous Galerkin, $C^0$ interior penalty, and WOPSIP discretizations with piecewise quadratic polynomials.
Unmeasured confounding presents a common challenge in observational studies, potentially making standard causal parameters unidentifiable without additional assumptions. Given the increasing availability of diverse data sources, exploiting data linkage offers a potential solution to mitigate unmeasured confounding within a primary study of interest. However, this approach often introduces selection bias, as data linkage is feasible only for a subset of the study population. To address this concern, we explore three nonparametric identification strategies under the assumption that a unit' s inclusion in the linked cohort is determined solely by the observed confounders, while acknowledging that the ignorability assumption may depend on some partially unobserved covariates. The existence of multiple identification strategies motivates the development of estimators that effectively capture distinct components of the observed data distribution. Appropriately combining these estimators yields triply robust estimators for the average treatment effect. These estimators remain consistent if at least one of the three distinct parts of the observed data law is correct. Moreover, they are locally efficient if all the models are correctly specified. We evaluate the proposed estimators using simulation studies and real data analysis.
In this paper, we address the challenge of Nash equilibrium (NE) seeking in non-cooperative convex games with partial-decision information. We propose a distributed algorithm, where each agent refines its strategy through projected-gradient steps and an averaging procedure. Each agent uses estimates of competitors' actions obtained solely from local neighbor interactions, in a directed communication network. Unlike previous approaches that rely on (strong) monotonicity assumptions, this work establishes the convergence towards a NE under a diagonal dominance property of the pseudo-gradient mapping, that can be checked locally by the agents. Further, this condition is physically interpretable and of relevance for many applications, as it suggests that an agent's objective function is primarily influenced by its individual strategic decisions, rather than by the actions of its competitors. In virtue of a novel block-infinity norm convergence argument, we provide explicit bounds for constant step-size that are independent of the communication structure, and can be computed in a totally decentralized way. Numerical simulations on an optical network's power control problem validate the algorithm's effectiveness.
Several fundamental problems in science and engineering consist of global optimization tasks involving unknown high-dimensional (black-box) functions that map a set of controllable variables to the outcomes of an expensive experiment. Bayesian Optimization (BO) techniques are known to be effective in tackling global optimization problems using a relatively small number objective function evaluations, but their performance suffers when dealing with high-dimensional outputs. To overcome the major challenge of dimensionality, here we propose a deep learning framework for BO and sequential decision making based on bootstrapped ensembles of neural architectures with randomized priors. Using appropriate architecture choices, we show that the proposed framework can approximate functional relationships between design variables and quantities of interest, even in cases where the latter take values in high-dimensional vector spaces or even infinite-dimensional function spaces. In the context of BO, we augmented the proposed probabilistic surrogates with re-parameterized Monte Carlo approximations of multiple-point (parallel) acquisition functions, as well as methodological extensions for accommodating black-box constraints and multi-fidelity information sources. We test the proposed framework against state-of-the-art methods for BO and demonstrate superior performance across several challenging tasks with high-dimensional outputs, including a constrained multi-fidelity optimization task involving shape optimization of rotor blades in turbo-machinery.
This paper introduces a comprehensive framework to adjust a discrete test statistic for improving its hypothesis testing procedure. The adjustment minimizes the Wasserstein distance to a null-approximating continuous distribution, tackling some fundamental challenges inherent in combining statistical significances derived from discrete distributions. The related theory justifies Lancaster's mid-p and mean-value chi-squared statistics for Fisher's combination as special cases. However, in order to counter the conservative nature of Lancaster's testing procedures, we propose an updated null-approximating distribution. It is achieved by further minimizing the Wasserstein distance to the adjusted statistics within a proper distribution family. Specifically, in the context of Fisher's combination, we propose an optimal gamma distribution as a substitute for the traditionally used chi-squared distribution. This new approach yields an asymptotically consistent test that significantly improves type I error control and enhances statistical power.
Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It first leverages multivariate responses to separate marginal and uncorrelated confounding effects, recovering the confounding coefficients' column space. Subsequently, latent factors and primary effects are jointly estimated, utilizing $\ell_1$-regularization for sparsity while imposing orthogonality onto confounding coefficients. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish various effects' identification conditions and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.
We establish tight bi-Lipschitz bounds certifying quasi-universality (universality up to a constant factor) for various distances between Reeb graphs: the interleaving distance, the functional distortion distance, and the functional contortion distance. The definition of the latter distance is a novel contribution, and for the special case of contour trees we also prove strict universality of this distance. Furthermore, we prove that for the special case of merge trees the functional contortion distance coincides with the interleaving distance, yielding universality of all four distances in this case.
This paper explores the generalization characteristics of iterative learning algorithms with bounded updates for non-convex loss functions, employing information-theoretic techniques. Our key contribution is a novel bound for the generalization error of these algorithms with bounded updates, extending beyond the scope of previous works that only focused on Stochastic Gradient Descent (SGD). Our approach introduces two main novelties: 1) we reformulate the mutual information as the uncertainty of updates, providing a new perspective, and 2) instead of using the chaining rule of mutual information, we employ a variance decomposition technique to decompose information across iterations, allowing for a simpler surrogate process. We analyze our generalization bound under various settings and demonstrate improved bounds when the model dimension increases at the same rate as the number of training data samples. To bridge the gap between theory and practice, we also examine the previously observed scaling behavior in large language models. Ultimately, our work takes a further step for developing practical generalization theories.
While several previous studies have devised methods for segmentation of polyps, most of these methods are not rigorously assessed on multi-center datasets. Variability due to appearance of polyps from one center to another, difference in endoscopic instrument grades, and acquisition quality result in methods with good performance on in-distribution test data, and poor performance on out-of-distribution or underrepresented samples. Unfair models have serious implications and pose a critical challenge to clinical applications. We adapt an implicit bias mitigation method which leverages Bayesian epistemic uncertainties during training to encourage the model to focus on underrepresented sample regions. We demonstrate the potential of this approach to improve generalisability without sacrificing state-of-the-art performance on a challenging multi-center polyp segmentation dataset (PolypGen) with different centers and image modalities.
This study focuses on the use of model and data fusion for improving the Spalart-Allmaras (SA) closure model for Reynolds-averaged Navier-Stokes solutions of separated flows. In particular, our goal is to develop of models that not-only assimilate sparse experimental data to improve performance in computational models, but also generalize to unseen cases by recovering classical SA behavior. We achieve our goals using data assimilation, namely the Ensemble Kalman Filtering approach (EnKF), to calibrate the coefficients of the SA model for separated flows. A holistic calibration strategy is implemented via a parameterization of the production, diffusion, and destruction terms. This calibration relies on the assimilation of experimental data collected velocity profiles, skin friction, and pressure coefficients for separated flows. Despite using of observational data from a single flow condition around a backward-facing step (BFS), the recalibrated SA model demonstrates generalization to other separated flows, including cases such as the 2D-bump and modified BFS. Significant improvement is observed in the quantities of interest, i.e., skin friction coefficient ($C_f$) and pressure coefficient ($C_p$) for each flow tested. Finally, it is also demonstrated that the newly proposed model recovers SA proficiency for external, unseparated flows, such as flow around a NACA-0012 airfoil without any danger of extrapolation, and that the individually calibrated terms in the SA model are targeted towards specific flow-physics wherein the calibrated production term improves the re-circulation zone while destruction improves the recovery zone.