Developing an efficient computational scheme for high-dimensional Bayesian variable selection in generalised linear models and survival models has always been a challenging problem due to the absence of closed-form solutions for the marginal likelihood. The RJMCMC approach can be employed to samples model and coefficients jointly, but effective design of the transdimensional jumps of RJMCMC can be challenge, making it hard to implement. Alternatively, the marginal likelihood can be derived using data-augmentation scheme e.g. Polya-gamma data argumentation for logistic regression) or through other estimation methods. However, suitable data-augmentation schemes are not available for every generalised linear and survival models, and using estimations such as Laplace approximation or correlated pseudo-marginal to derive marginal likelihood within a locally informed proposal can be computationally expensive in the "large n, large p" settings. In this paper, three main contributions are presented. Firstly, we present an extended Point-wise implementation of Adaptive Random Neighbourhood Informed proposal (PARNI) to efficiently sample models directly from the marginal posterior distribution in both generalised linear models and survival models. Secondly, in the light of the approximate Laplace approximation, we also describe an efficient and accurate estimation method for the marginal likelihood which involves adaptive parameters. Additionally, we describe a new method to adapt the algorithmic tuning parameters of the PARNI proposal by replacing the Rao-Blackwellised estimates with the combination of a warm-start estimate and an ergodic average. We present numerous numerical results from simulated data and 8 high-dimensional gene fine mapping data-sets to showcase the efficiency of the novel PARNI proposal compared to the baseline add-delete-swap proposal.
We propose a method to detect model misspecifications in nonlinear causal additive and potentially heteroscedastic noise models. We aim to identify predictor variables for which we can infer the causal effect even in cases of such misspecification. We develop a general framework based on knowledge of the multivariate observational data distribution and we then propose an algorithm for finite sample data, discuss its asymptotic properties, and illustrate its performance on simulated and real data.
In variable selection, a selection rule that prescribes the permissible sets of selected variables (called a "selection dictionary") is desirable due to the inherent structural constraints among the candidate variables. Such selection rules can be complex in real-world data analyses, and failing to incorporate such restrictions could not only compromise the interpretability of the model but also lead to decreased prediction accuracy. However, no general framework has been proposed to formalize selection rules and their applications, which poses a significant challenge for practitioners seeking to integrate these rules into their analyses. In this work, we establish a framework for structured variable selection that can incorporate universal structural constraints. We develop a mathematical language for constructing arbitrary selection rules, where the selection dictionary is formally defined. We demonstrate that all selection rules can be expressed as combinations of operations on constructs, facilitating the identification of the corresponding selection dictionary. Once this selection dictionary is derived, practitioners can apply their own user-defined criteria to select the optimal model. Additionally, our framework enhances existing penalized regression methods for variable selection by providing guidance on how to appropriately group variables to achieve the desired selection rule. Furthermore, our innovative framework opens the door to establishing new l0 norm-based penalized regression techniques that can be tailored to respect arbitrary selection rules, thereby expanding the possibilities for more robust and tailored model development.
Meta-analysis is the aggregation of data from multiple studies to find patterns across a broad range relating to a particular subject. It is becoming increasingly useful to apply meta-analysis to summarize these studies being done across various fields. In meta-analysis, it is common to use the mean and standard deviation from each study to compare for analysis. While many studies reported mean and standard deviation for their summary statistics, some report other values including the minimum, maximum, median, and first and third quantiles. Often, the quantiles and median are reported when the data is skewed and does not follow a normal distribution. In order to correctly summarize the data and draw conclusions from multiple studies, it is necessary to estimate the mean and standard deviation from each study, considering variation and skewness within each study. In past literature, methods have been proposed to estimate the mean and standard deviation, but do not consider negative values. Data that include negative values are common and would increase the accuracy and impact of the me-ta-analysis. We propose a method that implements a generalized Box-Cox transformation to estimate the mean and standard deviation accounting for such negative values while maintaining similar accuracy.
Diffusion model has become a main paradigm for synthetic data generation in many subfields of modern machine learning, including computer vision, language model, or speech synthesis. In this paper, we leverage the power of diffusion model for generating synthetic tabular data. The heterogeneous features in tabular data have been main obstacles in tabular data synthesis, and we tackle this problem by employing the auto-encoder architecture. When compared with the state-of-the-art tabular synthesizers, the resulting synthetic tables from our model show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities. We conducted the experiments over 15 publicly available datasets. Notably, our model adeptly captures the correlations among features, which has been a long-standing challenge in tabular data synthesis. Our code is available upon request and will be publicly released if paper is accepted.
We propose a simple method for simulating a general class of non-unitary dynamics as a linear combination of Hamiltonian simulation (LCHS) problems. LCHS does not rely on converting the problem into a dilated linear system problem, or on the spectral mapping theorem. The latter is the mathematical foundation of many quantum algorithms for solving a wide variety of tasks involving non-unitary processes, such as the quantum singular value transformation (QSVT). The LCHS method can achieve optimal cost in terms of state preparation. We also demonstrate an application for open quantum dynamics simulation using the complex absorbing potential method with near-optimal dependence on all parameters.
Derivatives are a key nonparametric functional in wide-ranging applications where the rate of change of an unknown function is of interest. In the Bayesian paradigm, Gaussian processes (GPs) are routinely used as a flexible prior for unknown functions, and are arguably one of the most popular tools in many areas. However, little is known about the optimal modelling strategy and theoretical properties when using GPs for derivatives. In this article, we study a plug-in strategy by differentiating the posterior distribution with GP priors for derivatives of any order. This practically appealing plug-in GP method has been previously perceived as suboptimal and degraded, but this is not necessarily the case. We provide posterior contraction rates for plug-in GPs and establish that they remarkably adapt to derivative orders. We show that the posterior measure of the regression function and its derivatives, with the same choice of hyperparameter that does not depend on the order of derivatives, converges at the minimax optimal rate up to a logarithmic factor for functions in certain classes. We analyze a data-driven hyperparameter tuning method based on empirical Bayes, and show that it satisfies the optimal rate condition while maintaining computational efficiency. This article to the best of our knowledge provides the first positive result for plug-in GPs in the context of inferring derivative functionals, and leads to a practically simple nonparametric Bayesian method with optimal and adaptive hyperparameter tuning for simultaneously estimating the regression function and its derivatives. Simulations show competitive finite sample performance of the plug-in GP method. A climate change application for analyzing the global sea-level rise is discussed.
A numerical method is proposed for simulation of composite open quantum systems. It is based on Lindblad master equations and adiabatic elimination. Each subsystem is assumed to converge exponentially towards a stationary subspace, slightly impacted by some decoherence channels and weakly coupled to the other subsystems. This numerical method is based on a perturbation analysis with an asymptotic expansion. It exploits the formulation of the slow dynamics with reduced dimension. It relies on the invariant operators of the local and nominal dissipative dynamics attached to each subsystem. Second-order expansion can be computed only with local numerical calculations. It avoids computations on the tensor-product Hilbert space attached to the full system. This numerical method is particularly well suited for autonomous quantum error correction schemes. Simulations of such reduced models agree with complete full model simulations for typical gates acting on one and two cat-qubits (Z, ZZ and CNOT) when the mean photon number of each cat-qubit is less than 8. For larger mean photon numbers and gates with three cat-qubits (ZZZ and CCNOT), full model simulations are almost impossible whereas reduced model simulations remain accessible. In particular, they capture both the dominant phase-flip error-rate and the very small bit-flip error-rate with its exponential suppression versus the mean photon number.
Models of complex technological systems inherently contain interactions and dependencies among their input variables that affect their joint influence on the output. Such models are often computationally expensive and few sensitivity analysis methods can effectively process such complexities. Moreover, the sensitivity analysis field as a whole pays limited attention to the nature of interaction effects, whose understanding can prove to be critical for the design of safe and reliable systems. In this paper, we introduce and extensively test a simple binning approach for computing sensitivity indices and demonstrate how complementing it with the smart visualization method, simulation decomposition (SimDec), can permit important insights into the behavior of complex engineering models. The simple binning approach computes first-, second-order effects, and a combined sensitivity index, and is considerably more computationally efficient than Sobol' indices. The totality of the sensitivity analysis framework provides an efficient and intuitive way to analyze the behavior of complex systems containing interactions and dependencies.
We present the conditional determinantal point process (DPP) approach to obtain new (mostly Fredholm determinantal) expressions for various eigenvalue statistics in random matrix theory. It is well-known that many (especially $\beta=2$) eigenvalue $n$-point correlation functions are given in terms of $n\times n$ determinants, i.e., they are continuous DPPs. We exploit a derived kernel of the conditional DPP which gives the $n$-point correlation function conditioned on the event of some eigenvalues already existing at fixed locations. Using such kernels we obtain new determinantal expressions for the joint densities of the $k$ largest eigenvalues, probability density functions of the $k^\text{th}$ largest eigenvalue, density of the first eigenvalue spacing, and more. Our formulae are highly amenable to numerical computations and we provide various numerical experiments. Several numerical values that required hours of computing time could now be computed in seconds with our expressions, which proves the effectiveness of our approach. We also demonstrate that our technique can be applied to an efficient sampling of DR paths of the Aztec diamond domino tiling. Further extending the conditional DPP sampling technique, we sample Airy processes from the extended Airy kernel. Additionally we propose a sampling method for non-Hermitian projection DPPs.
Differential geometric approaches are ubiquitous in several fields of mathematics, physics and engineering, and their discretizations enable the development of network-based mathematical and computational frameworks, which are essential for large-scale data science. The Forman-Ricci curvature (FRC) - a statistical measure based on Riemannian geometry and designed for networks - is known for its high capacity for extracting geometric information from complex networks. However, extracting information from dense networks is still challenging due to the combinatorial explosion of high-order network structures. Motivated by this challenge we sought a set-theoretic representation theory for high-order network cells and FRC, as well as their associated concepts and properties, which together provide an alternative and efficient formulation for computing high-order FRC in complex networks. We provide a pseudo-code, a software implementation coined FastForman, as well as a benchmark comparison with alternative implementations. Crucially, our representation theory reveals previous computational bottlenecks and also accelerates the computation of FRC. As a consequence, our findings open new research possibilities in complex systems where higher-order geometric computations are required.