The finite-population asymptotic theory provides a normal approximation for the sampling distribution of the average treatment effect estimator in stratified randomized experiments. The asymptotic variance is often estimated by a Neyman-type conservative variance estimator. However, the variance estimator can be overly conservative, and the asymptotic theory may fail in small samples. To solve these issues, we propose a sharp variance estimator for the difference-in-means estimator weighted by the proportion of stratum sizes in stratified randomized experiments. Furthermore, we propose two causal bootstrap procedures to more accurately approximate the sampling distribution of the weighted difference-in-means estimator. The first causal bootstrap procedure is based on rank-preserving imputation and we show that it has second-order refinement over normal approximation. The second causal bootstrap procedure is based on sharp null imputation and is applicable in paired experiments. Our analysis is randomization-based or design-based by conditioning on the potential outcomes, with treatment assignment being the sole source of randomness. Numerical studies and real data analyses demonstrate advantages of our proposed methods in finite samples.
As floods are a major and growing source of risk in urban areas, there is a necessity to improve flood risk management frameworks and civil protection through planning interventions that modify surface flow pathways and introduce storage. Despite the complexity of densely urbanised areas, modern flood models can represent urban features and flow characteristics to help researchers, local authorities, and insurance companies to develop and improve efficient flood risk frameworks to achieve resilience in cities. A cost-benefit driven source-receptor flood risk framework is developed in this study to identify (1) locations contributing to surface flooding (sources), (2) buildings and locations at high flood risk (receptors), (3) the cost-benefit nexus between the source and the receptor, and finally (4) ways to mitigate flooding at the receptor by adding Blue-Green Infrastructure (BGI) in critical locations. The analysis is based on five steps to identify the source and the receptor in a study area based on the flood exposure of buildings, damages arising from flooding and available green spaces with the best potential to add sustainable and resilient solutions to reduce flooding. The framework was developed using the detailed hydrodynamic model CityCAT in a case study of the city centre of Newcastle upon Tyne, UK. The novelty of this analysis is that firstly, multiple storm magnitudes (i.e. small and large floods) are used combined with a method to locate the areas and the buildings at flood risk and a prioritized set of best places to add interventions upstream and downstream. Secondly, planning decisions are informed by considering the benefit from reduced damages to properties and the cost to construct resilient BGI options rather than a restricted hydraulic analysis considering only flood depths and storages in isolation from real-world economics.
The purpose of anonymizing structured data is to protect the privacy of individuals in the data while retaining the statistical properties of the data. There is a large body of work that examines anonymization vulnerabilities. Focusing on strong anonymization mechanisms, this paper examines a number of prominent attack papers and finds several problems, all of which lead to overstating risk. First, some papers fail to establish a correct statistical inference baseline (or any at all), leading to incorrect measures. Notably, the reconstruction attack from the US Census Bureau that led to a redesign of its disclosure method made this mistake. We propose the non-member framework, an improved method for how to compute a more accurate inference baseline, and give examples of its operation. Second, some papers don't use a realistic membership base rate, leading to incorrect precision measures if precision is reported. Third, some papers unnecessarily report measures in such a way that it is difficult or impossible to assess risk. Virtually the entire literature on membership inference attacks, dozens of papers, make one or both of these errors. We propose that membership inference papers report precision/recall values using a representative range of base rates.
Change-point detection, detecting an abrupt change in the data distribution from sequential data, is a fundamental problem in statistics and machine learning. CUSUM is a popular statistical method for online change-point detection due to its efficiency from recursive computation and constant memory requirement, and it enjoys statistical optimality. CUSUM requires knowing the precise pre- and post-change distribution. However, post-change distribution is usually unknown a priori since it represents anomaly and novelty. Classic CUSUM can perform poorly when there is a model mismatch with actual data. While likelihood ratio-based methods encounter challenges facing high dimensional data, neural networks have become an emerging tool for change-point detection with computational efficiency and scalability. In this paper, we introduce a neural network CUSUM (NN-CUSUM) for online change-point detection. We also present a general theoretical condition when the trained neural networks can perform change-point detection and what losses can achieve our goal. We further extend our analysis by combining it with the Neural Tangent Kernel theory to establish learning guarantees for the standard performance metrics, including the average run length (ARL) and expected detection delay (EDD). The strong performance of NN-CUSUM is demonstrated in detecting change-point in high-dimensional data using both synthetic and real-world data.
The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is shown that the components of the population version of the estimator correspond to the well separated components of $P$. This provides some theoretical justification for the use of such estimators for cluster analysis in case that $P$ has well separated subpopulations even if these subpopulations differ from what the mixture model assumes.
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions.
Distributed control increases system scalability, flexibility, and redundancy. Foundational to such decentralisation is consensus formation, by which decision-making and coordination are achieved. However, decentralised multi-agent systems are inherently vulnerable to disruption. To develop a resilient consensus approach, inspiration is taken from the study of social systems and their dynamics; specifically, the Deffuant Model. A dynamic algorithm is presented enabling efficient consensus to be reached with an unknown number of disruptors present within a multi-agent system. By inverting typical social tolerance, agents filter out extremist non-standard opinions that would drive them away from consensus. This approach allows distributed systems to deal with unknown disruptions, without knowledge of the network topology or the numbers and behaviours of the disruptors. A disruptor-agnostic algorithm is particularly suitable to real-world applications where this information is typically unknown. Faster and tighter convergence can be achieved across a range of scenarios with the social dynamics inspired algorithm, compared with standard Mean-Subsequence-Reduced-type methods.
Multi-contrast (MC) Magnetic Resonance Imaging (MRI) reconstruction aims to incorporate a reference image of auxiliary modality to guide the reconstruction process of the target modality. Known MC reconstruction methods perform well with a fully sampled reference image, but usually exhibit inferior performance, compared to single-contrast (SC) methods, when the reference image is missing or of low quality. To address this issue, we propose DuDoUniNeXt, a unified dual-domain MRI reconstruction network that can accommodate to scenarios involving absent, low-quality, and high-quality reference images. DuDoUniNeXt adopts a hybrid backbone that combines CNN and ViT, enabling specific adjustment of image domain and k-space reconstruction. Specifically, an adaptive coarse-to-fine feature fusion module (AdaC2F) is devised to dynamically process the information from reference images of varying qualities. Besides, a partially shared shallow feature extractor (PaSS) is proposed, which uses shared and distinct parameters to handle consistent and discrepancy information among contrasts. Experimental results demonstrate that the proposed model surpasses state-of-the-art SC and MC models significantly. Ablation studies show the effectiveness of the proposed hybrid backbone, AdaC2F, PaSS, and the dual-domain unified learning scheme.
We develop a theory of asymptotic efficiency in regular parametric models when data confidentiality is ensured by local differential privacy (LDP). Even though efficient parameter estimation is a classical and well-studied problem in mathematical statistics, it leads to several non-trivial obstacles that need to be tackled when dealing with the LDP case. Starting from a standard parametric model $\mathcal P=(P_\theta)_{\theta\in\Theta}$, $\Theta\subseteq\mathbb R^p$, for the iid unobserved sensitive data $X_1,\dots, X_n$, we establish local asymptotic mixed normality (along subsequences) of the model $$Q^{(n)}\mathcal P=(Q^{(n)}P_\theta^n)_{\theta\in\Theta}$$ generating the sanitized observations $Z_1,\dots, Z_n$, where $Q^{(n)}$ is an arbitrary sequence of sequentially interactive privacy mechanisms. This result readily implies convolution and local asymptotic minimax theorems. In case $p=1$, the optimal asymptotic variance is found to be the inverse of the supremal Fisher-Information $\sup_{Q\in\mathcal Q_\alpha} I_\theta(Q\mathcal P)\in\mathbb R$, where the supremum runs over all $\alpha$-differentially private (marginal) Markov kernels. We present an algorithm for finding a (nearly) optimal privacy mechanism $\hat{Q}$ and an estimator $\hat{\theta}_n(Z_1,\dots, Z_n)$ based on the corresponding sanitized data that achieves this asymptotically optimal variance.
We propose some new results on the comparison of the minimum or maximum order statistic from a random number of non-identical random variables. Under the non-identical set-up, with certain conditions, we prove that random minimum (maximum) of one system dominates the other in hazard rate (reversed hazard rate) order. Further, we prove variation diminishing property (Karlin [8]) for all possible restrictions to derive the new results.
We often rely on censuses of triangulations to guide our intuition in $3$-manifold topology. However, this can lead to misplaced faith in conjectures if the smallest counterexamples are too large to appear in our census. Since the number of triangulations increases super-exponentially with size, there is no way to expand a census beyond relatively small triangulations; the current census only goes up to $10$ tetrahedra. Here, we show that it is feasible to search for large and hard-to-find counterexamples by using heuristics to selectively (rather than exhaustively) enumerate triangulations. We use this idea to find counterexamples to three conjectures which ask, for certain $3$-manifolds, whether one-vertex triangulations always have a "distinctive" edge that would allow us to recognise the $3$-manifold.