In cluster-randomized experiments, there is emerging interest in exploring the causal mechanism in which a cluster-level treatment affects the outcome through an intermediate outcome. Despite an extensive development of causal mediation methods in the past decade, only a few exceptions have been considered in assessing causal mediation in cluster-randomized studies, all of which depend on parametric model-based estimators. In this article, we develop the formal semiparametric efficiency theory to motivate several doubly-robust methods for addressing several mediation effect estimands corresponding to both the cluster-average and the individual-level treatment effects in cluster-randomized experiments -- the natural indirect effect, natural direct effect, and spillover mediation effect. We derive the efficient influence function for each mediation effect, and carefully parameterize each efficient influence function to motivate practical strategies for operationalizing each estimator. We consider both parametric working models and data-adaptive machine learners to estimate the nuisance functions, and obtain semiparametric efficient causal mediation estimators in the latter case. Our methods are illustrated via extensive simulations and two completed cluster-randomized experiments.
We study the identification of binary choice models with fixed effects. We provide a condition called sign saturation and show that this condition is sufficient for the identification of the model. In particular, we can guarantee identification even when all the regressors are bounded, including multiple discrete regressors. We also show that without this condition, the model is not identified unless the error distribution belongs to a special class. The same sign saturation condition is also essential for identifying the sign of treatment effects. A test is provided to check the sign saturation condition and can be implemented using existing algorithms for the maximum score estimator.
Identifying predictive covariates, which forecast individual treatment effectiveness, is crucial for decision-making across different disciplines such as personalized medicine. These covariates, referred to as biomarkers, are extracted from pre-treatment data, often within randomized controlled trials, and should be distinguished from prognostic biomarkers, which are independent of treatment assignment. Our study focuses on discovering predictive imaging biomarkers, specific image features, by leveraging pre-treatment images to uncover new causal relationships. Unlike labor-intensive approaches relying on handcrafted features prone to bias, we present a novel task of directly learning predictive features from images. We propose an evaluation protocol to assess a model's ability to identify predictive imaging biomarkers and differentiate them from purely prognostic ones by employing statistical testing and a comprehensive analysis of image feature attribution. We explore the suitability of deep learning models originally developed for estimating the conditional average treatment effect (CATE) for this task, which have been assessed primarily for their precision of CATE estimation while overlooking the evaluation of imaging biomarker discovery. Our proof-of-concept analysis demonstrates the feasibility and potential of our approach in discovering and validating predictive imaging biomarkers from synthetic outcomes and real-world image datasets. Our code is available at \url{//github.com/MIC-DKFZ/predictive_image_biomarker_analysis}.
We systematically investigate the preservation of differential privacy in functional data analysis, beginning with functional mean estimation and extending to varying coefficient model estimation. Our work introduces a distributed learning framework involving multiple servers, each responsible for collecting several sparsely observed functions. This hierarchical setup introduces a mixed notion of privacy. Within each function, user-level differential privacy is applied to $m$ discrete observations. At the server level, central differential privacy is deployed to account for the centralised nature of data collection. Across servers, only private information is exchanged, adhering to federated differential privacy constraints. To address this complex hierarchy, we employ minimax theory to reveal several fundamental phenomena: from sparse to dense functional data analysis, from user-level to central and federated differential privacy costs, and the intricate interplay between different regimes of functional data analysis and privacy preservation. To the best of our knowledge, this is the first study to rigorously examine functional data estimation under multiple privacy constraints. Our theoretical findings are complemented by efficient private algorithms and extensive numerical evidence, providing a comprehensive exploration of this challenging problem.
Methods for analyzing representations in neural systems have become a popular tool in both neuroscience and mechanistic interpretability. Having measures to compare how similar activations of neurons are across conditions, architectures, and species, gives us a scalable way of learning how information is transformed within different neural networks. In contrast to this trend, recent investigations have revealed how some metrics can respond to spurious signals and hence give misleading results. To identify the most reliable metric and understand how measures could be improved, it is going to be important to identify specific test cases which can serve as benchmarks. Here we propose that the phenomena of compositional learning in recurrent neural networks (RNNs) allows us to build a test case for dynamical representation alignment metrics. By implementing this case, we show it enables us to test whether metrics can identify representations which gradually develop throughout learning and probe whether representations identified by metrics are relevant to computations executed by networks. By building both an attractor- and RNN-based test case, we show that the new Dynamical Similarity Analysis (DSA) is more noise robust and identifies behaviorally relevant representations more reliably than prior metrics (Procrustes, CKA). We also show how test cases can be used beyond evaluating metrics to study new architectures. Specifically, results from applying DSA to modern (Mamba) state space models, suggest that, in contrast to RNNs, these models may not exhibit changes to their recurrent dynamics due to their expressiveness. Overall, by developing test cases, we show DSA's exceptional ability to detect compositional dynamical motifs, thereby enhancing our understanding of how computations unfold in RNNs.
In longitudinal observational studies with time-to-event outcomes, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios. The g-formula is a useful tool for this analysis. To enhance the traditional parametric g-formula, we developed an alternative g-formula estimator, which incorporates the Bayesian Additive Regression Trees (BART) into the modeling of the time-evolving generative components, aiming to mitigate the bias due to model misspecification. We focus on binary time-varying treatments and introduce a general class of g-formulas for discrete survival data that can incorporate the longitudinal balancing scores. The minimum sufficient formulation of these longitudinal balancing scores is linked to the nature of treatment strategies, i.e., static or dynamic. For each type of treatment strategy, we provide posterior sampling algorithms. We conducted simulations to illustrate the empirical performance of the proposed method and demonstrate its practical utility using data from the Yale New Haven Health System's electronic health records.
Clustering and outlier detection are two important tasks in data mining. Outliers frequently interfere with clustering algorithms to determine the similarity between objects, resulting in unreliable clustering results. Currently, only a few clustering algorithms (e.g., DBSCAN) have the ability to detect outliers to eliminate interference. For other clustering algorithms, it is tedious to introduce another outlier detection task to eliminate outliers before each clustering process. Obviously, how to equip more clustering algorithms with outlier detection ability is very meaningful. Although a common strategy allows clustering algorithms to detect outliers based on the distance between objects and clusters, it is contradictory to improving the performance of clustering algorithms on the datasets with outliers. In this paper, we propose a novel outlier detection approach, called ODAR, for clustering. ODAR maps outliers and normal objects into two separated clusters by feature transformation. As a result, any clustering algorithm can detect outliers by identifying clusters. Experiments show that ODAR is robust to diverse datasets. Compared with baseline methods, the clustering algorithms achieve the best on 7 out of 10 datasets with the help of ODAR, with at least 5% improvement in accuracy.
Neural operators are aiming at approximating operators mapping between Banach spaces of functions, achieving much success in the field of scientific computing. Compared to certain deep learning-based solvers, such as Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural operators can solve a class of Partial Differential Equations (PDEs). Although much work has been done to analyze the approximation and generalization error of neural operators, there is still a lack of analysis on their training error. In this work, we conduct the convergence analysis of gradient descent for the wide shallow neural operators within the framework of Neural Tangent Kernel (NTK). The core idea lies on the fact that over-parameterization and random initialization together ensure that each weight vector remains near its initialization throughout all iterations, yielding the linear convergence of gradient descent. In this work, we demonstrate that under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.
Multidimensional quaternion arrays (often referred to as "quaternion tensors") and their decompositions have recently gained increasing attention in various fields such as color and polarimetric imaging or video processing. Despite this growing interest, the theoretical development of quaternion tensors remains limited. This paper introduces a novel multilinear framework for quaternion arrays, which extends the classical tensor analysis to multidimensional quaternion data in a rigorous manner. Specifically, we propose a new definition of quaternion tensors as $\mathbb{H}\mathbb{R}$-multilinear forms, addressing the challenges posed by the non-commutativity of quaternion multiplication. Within this framework, we establish the Tucker decomposition for quaternion tensors and develop a quaternion Canonical Polyadic Decomposition (Q-CPD). We thoroughly investigate the properties of the Q-CPD, including trivial ambiguities, complex equivalent models, and sufficient conditions for uniqueness. Additionally, we present two algorithms for computing the Q-CPD and demonstrate their effectiveness through numerical experiments. Our results provide a solid theoretical foundation for further research on quaternion tensor decompositions and offer new computational tools for practitioners working with quaternion multiway data.
The problem of sequential anomaly detection and identification is considered, where multiple data sources are simultaneously monitored and the goal is to identify in real time those, if any, that exhibit ``anomalous" statistical behavior. An upper bound is postulated on the number of data sources that can be sampled at each sampling instant, but the decision maker selects which ones to sample based on the already collected data. Thus, in this context, a policy consists not only of a stopping rule and a decision rule that determine when sampling should be terminated and which sources to identify as anomalous upon stopping, but also of a sampling rule that determines which sources to sample at each time instant subject to the sampling constraint. Two distinct formulations are considered, which require control of different, ``generalized" error metrics. The first one tolerates a certain user-specified number of errors, of any kind, whereas the second tolerates distinct, user-specified numbers of false positives and false negatives. For each of them, a universal asymptotic lower bound on the expected time for stopping is established as the error probabilities go to 0, and it is shown to be attained by a policy that combines the stopping and decision rules proposed in the full-sampling case with a probabilistic sampling rule that achieves a specific long-run sampling frequency for each source. Moreover, the optimal to a first order asymptotic approximation expected time for stopping is compared in simulation studies with the corresponding factor in a finite regime, and the impact of the sampling constraint and tolerance to errors is assessed.
Given that distributed systems face adversarial behaviors such as eavesdropping and cyberattacks, how to ensure the evidence fusion result is credible becomes a must-be-addressed topic. Different from traditional research that assumes nodes are cooperative, we focus on three requirements for evidence fusion, i.e., preserving evidence's privacy, identifying attackers and excluding their evidence, and dissipating high-conflicting among evidence caused by random noise and interference. To this end, this paper proposes an algorithm for credible evidence fusion against cyberattacks. Firstly, the fusion strategy is constructed based on conditionalized credibility to avoid counterintuitive fusion results caused by high-conflicting. Under this strategy, distributed evidence fusion is transformed into the average consensus problem for the weighted average value by conditional credibility of multi-source evidence (WAVCCME), which implies a more concise consensus process and lower computational complexity than existing algorithms. Secondly, a state decomposition and reconstruction strategy with weight encryption is designed, and its effectiveness for privacy-preserving under directed graphs is guaranteed: decomposing states into different random sub-states for different neighbors to defend against internal eavesdroppers, and encrypting the sub-states' weight in the reconstruction to guard against out-of-system eavesdroppers. Finally, the identities and types of attackers are identified by inter-neighbor broadcasting and comparison of nodes' states, and the proposed update rule with state corrections is used to achieve the consensus of the WAVCCME. The states of normal nodes are shown to converge to their WAVCCME, while the attacker's evidence is excluded from the fusion, as verified by the simulation on a distributed unmanned reconnaissance swarm.