Sample size determination for cluster randomised trials (CRTs) is challenging as it requires robust estimation of the intra-cluster correlation coefficient (ICC). Typically, the sample size is chosen to provide a certain level of power to reject the null hypothesis in a hypothesis test. This relies on the minimal clinically important difference (MCID) and estimates for the standard deviation, ICC and possibly the coefficient of variation of the cluster size. Varying these parameters can have a strong effect on the sample size. In particular, it is sensitive to small differences in the ICC. A relevant ICC estimate is often not available, or the available estimate is imprecise. If the ICC used is far from the unknown true value, this can lead to trials which are substantially over- or under-powered. We propose a hybrid approach using Bayesian assurance to find the sample size for a CRT with a frequentist analysis. Assurance is an alternative to power which incorporates uncertainty on parameters through a prior distribution. We suggest specifying prior distributions for the standard deviation, ICC and coefficient of variation of the cluster size, while still utilising the MCID. We illustrate the approach through the design of a CRT in post-stroke incontinence. We show assurance can be used to find a sample size based on an elicited prior distribution for the ICC, when a power calculation discards all information in the prior except a single point estimate. Results show that this approach can avoid misspecifying sample sizes when prior medians for the ICC are very similar but prior distributions exhibit quite different behaviour. Assurance provides an understanding of the probability of success of a trial given an MCID and can be used to produce sample sizes which are robust to parameter uncertainty. This is especially useful when there is difficulty obtaining reliable parameter estimates.
One of the main challenges for interpreting black-box models is the ability to uniquely decompose square-integrable functions of non-mutually independent random inputs into a sum of functions of every possible subset of variables. However, dealing with dependencies among inputs can be complicated. We propose a novel framework to study this problem, linking three domains of mathematics: probability theory, functional analysis, and combinatorics. We show that, under two reasonable assumptions on the inputs (non-perfect functional dependence and non-degenerate stochastic dependence), it is always possible to decompose uniquely such a function. This ``canonical decomposition'' is relatively intuitive and unveils the linear nature of non-linear functions of non-linearly dependent inputs. In this framework, we effectively generalize the well-known Hoeffding decomposition, which can be seen as a particular case. Oblique projections of the black-box model allow for novel interpretability indices for evaluation and variance decomposition. Aside from their intuitive nature, the properties of these novel indices are studied and discussed. This result offers a path towards a more precise uncertainty quantification, which can benefit sensitivity analyses and interpretability studies, whenever the inputs are dependent. This decomposition is illustrated analytically, and the challenges to adopting these results in practice are discussed.
Private synthetic data sharing is preferred as it keeps the distribution and nuances of original data compared to summary statistics. The state-of-the-art methods adopt a select-measure-generate paradigm, but measuring large domain marginals still results in much error and allocating privacy budget iteratively is still difficult. To address these issues, our method employs a partition-based approach that effectively reduces errors and improves the quality of synthetic data, even with a limited privacy budget. Results from our experiments demonstrate the superiority of our method over existing approaches. The synthetic data produced using our approach exhibits improved quality and utility, making it a preferable choice for private synthetic data sharing.
Projection-based testing for mean trajectory differences in two groups of irregularly and sparsely observed functional data has garnered significant attention in the literature because it accommodates a wide spectrum of group differences and (non-stationary) covariance structures. This article presents the derivation of the theoretical power function and the introduction of a comprehensive power and sample size (PASS) calculation toolkit tailored to the projection-based testing method developed by Wang (2021). Our approach accommodates a wide spectrum of group difference scenarios and a broad class of covariance structures governing the underlying processes. Through extensive numerical simulation, we demonstrate the robustness of this testing method by showcasing that its statistical power remains nearly unaffected even when a certain percentage of observations are missing, rendering it 'missing-immune'. Furthermore, we illustrate the practical utility of this test through analysis of two randomized controlled trials of Parkinson's disease. To facilitate implementation, we provide a user-friendly R package fPASS, complete with a detailed vignette to guide users through its practical application. We anticipate that this article will significantly enhance the usability of this potent statistical tool across a range of biostatistical applications, with a particular focus on its relevance in the design of clinical trials.
This work is concerned with the uniform accuracy of implicit-explicit backward differentiation formulas for general linear hyperbolic relaxation systems satisfying the structural stability condition proposed previously by the third author. We prove the uniform stability and accuracy of a class of IMEX-BDF schemes discretized spatially by a Fourier spectral method. The result reveals that the accuracy of the fully discretized schemes is independent of the relaxation time in all regimes. It is verified by numerical experiments on several applications to traffic flows, rarefied gas dynamics and kinetic theory.
In the analyses of cluster-randomized trials, mixed-model analysis of covariance (ANCOVA) is a standard approach for covariate adjustment and handling within-cluster correlations. However, when the normality, linearity, or the random-intercept assumption is violated, the validity and efficiency of the mixed-model ANCOVA estimators for estimating the average treatment effect remain unclear. Under the potential outcomes framework, we prove that the mixed-model ANCOVA estimators for the average treatment effect are consistent and asymptotically normal under arbitrary misspecification of its working model. If the probability of receiving treatment is 0.5 for each cluster, we further show that the model-based variance estimator under mixed-model ANCOVA1 (ANCOVA without treatment-covariate interactions) remains consistent, clarifying that the confidence interval given by standard software is asymptotically valid even under model misspecification. Beyond robustness, we discuss several insights on precision among classical methods for analyzing cluster-randomized trials, including the mixed-model ANCOVA, individual-level ANCOVA, and cluster-level ANCOVA estimators. These insights may inform the choice of methods in practice. Our analytical results and insights are illustrated via simulation studies and analyses of three cluster-randomized trials.
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.
A central challenge in the verification of quantum computers is benchmarking their performance as a whole and demonstrating their computational capabilities. In this work, we find a universal model of quantum computation, Bell sampling, that can be used for both of those tasks and thus provides an ideal stepping stone towards fault-tolerance. In Bell sampling, we measure two copies of a state prepared by a quantum circuit in the transversal Bell basis. We show that the Bell samples are classically intractable to produce and at the same time constitute what we call a circuit shadow: from the Bell samples we can efficiently extract information about the quantum circuit preparing the state, as well as diagnose circuit errors. In addition to known properties that can be efficiently extracted from Bell samples, we give two new and efficient protocols, a test for the depth of the circuit and an algorithm to estimate a lower bound to the number of T gates in the circuit. With some additional measurements, our algorithm learns a full description of states prepared by circuits with low T-count.
We present a nonlinear interpolation technique for parametric fields that exploits optimal transportation of coherent structures of the solution to achieve accurate performance. The approach generalizes the nonlinear interpolation procedure introduced in [Iollo, Taddei, J. Comput. Phys., 2022] to multi-dimensional parameter domains and to datasets of several snapshots. Given a library of high-fidelity simulations, we rely on a scalar testing function and on a point set registration method to identify coherent structures of the solution field in the form of sorted point clouds. Given a new parameter value, we exploit a regression method to predict the new point cloud; then, we resort to a boundary-aware registration technique to define bijective mappings that deform the new point cloud into the point clouds of the neighboring elements of the dataset, while preserving the boundary of the domain; finally, we define the estimate as a weighted combination of modes obtained by composing the neighboring snapshots with the previously-built mappings. We present several numerical examples for compressible and incompressible, viscous and inviscid flows to demonstrate the accuracy of the method. Furthermore, we employ the nonlinear interpolation procedure to augment the dataset of simulations for linear-subspace projection-based model reduction: our data augmentation procedure is designed to reduce offline costs -- which are dominated by snapshot generation -- of model reduction techniques for nonlinear advection-dominated problems.
We present a multigrid algorithm to solve efficiently the large saddle-point systems of equations that typically arise in PDE-constrained optimization under uncertainty. The algorithm is based on a collective smoother that at each iteration sweeps over the nodes of the computational mesh, and solves a reduced saddle-point system whose size depends on the number $N$ of samples used to discretized the probability space. We show that this reduced system can be solved with optimal $O(N)$ complexity. We test the multigrid method on three problems: a linear-quadratic problem, possibly with a local or a boundary control, for which the multigrid method is used to solve directly the linear optimality system; a nonsmooth problem with box constraints and $L^1$-norm penalization on the control, in which the multigrid scheme is used within a semismooth Newton iteration; a risk-adverse problem with the smoothed CVaR risk measure where the multigrid method is called within a preconditioned Newton iteration. In all cases, the multigrid algorithm exhibits excellent performances and robustness with respect to the parameters of interest.
The emergence of complex structures in the systems governed by a simple set of rules is among the most fascinating aspects of Nature. The particularly powerful and versatile model suitable for investigating this phenomenon is provided by cellular automata, with the Game of Life being one of the most prominent examples. However, this simplified model can be too limiting in providing a tool for modelling real systems. To address this, we introduce and study an extended version of the Game of Life, with the dynamical process governing the rule selection at each step. We show that the introduced modification significantly alters the behaviour of the game. We also demonstrate that the choice of the synchronization policy can be used to control the trade-off between the stability and the growth in the system.