Empirical regression discontinuity (RD) studies often use covariates to increase the precision of their estimates. In this paper, we propose a novel class of estimators that use such covariate information more efficiently than the linear adjustment estimators that are currently used widely in practice. Our approach can accommodate a possibly large number of either discrete or continuous covariates. It involves running a standard RD analysis with an appropriately modified outcome variable, which takes the form of the difference between the original outcome and a function of the covariates. We characterize the function that leads to the estimator with the smallest asymptotic variance, and show how it can be estimated via modern machine learning, nonparametric regression, or classical parametric methods. The resulting estimator is easy to implement, as tuning parameters can be chosen as in a conventional RD analysis. An extensive simulation study illustrates the performance of our approach.
By exploiting the theory of skew-symmetric distributions, we generalise existing results in sensitivity analysis by providing the analytic expression of the bias induced by marginalization over an unobserved continuous confounder in a logistic regression model. The expression is approximated and mimics Cochran's formula under some simplifying assumptions. Other link functions and error distributions are also considered. A simulation study is performed to assess its properties. The derivations can also be applied in causal mediation analysis, thereby enlarging the number of circumstances where simple parametric formulations can be used to evaluate causal direct and indirect effects. Standard errors of the causal effect estimators are provided via the first-order Delta method. Simulations show that our proposed estimators perform equally well as others based on numerical methods and that the additional interpretability of the explicit formulas does not compromise their precision. The new estimator has been applied to measure the effect of humidity on upper airways diseases mediated by the presence of common aeroallergens in the air.
Reliable probabilistic primality tests are fundamental in public-key cryptography. In adversarial scenarios, a composite with a high probability of passing a specific primality test could be chosen. In such cases, we need worst-case error estimates for the test. However, in many scenarios the numbers are randomly chosen and thus have significantly smaller error probability. Therefore, we are interested in average case error estimates. In this paper, we establish such bounds for the strong Lucas primality test, as only worst-case, but no average case error bounds, are currently available. This allows us to use this test with more confidence. We examine an algorithm that draws odd $k$-bit integers uniformly and independently, runs $t$ independent iterations of the strong Lucas test with randomly chosen parameters, and outputs the first number that passes all $t$ consecutive rounds. We attain numerical upper bounds on the probability on returing a composite. Furthermore, we consider a modified version of this algorithm that excludes integers divisible by small primes, resulting in improved bounds. Additionally, we classify the numbers that contribute most to our estimate.
Mendelian randomization (MR) is an instrumental variable (IV) approach to infer causal relationships between exposures and outcomes with genome-wide association studies (GWAS) summary data. However, the multivariable inverse-variance weighting (IVW) approach, which serves as the foundation for most MR approaches, cannot yield unbiased causal effect estimates in the presence of many weak IVs. To address this problem, we proposed the MR using Bias-corrected Estimating Equation (MRBEE) that can infer unbiased causal relationships with many weak IVs and account for horizontal pleiotropy simultaneously. While the practical significance of MRBEE was demonstrated in our parallel work (Lorincz-Comi (2023)), this paper established the statistical theories of multivariable IVW and MRBEE with many weak IVs. First, we showed that the bias of the multivariable IVW estimate is caused by the error-in-variable bias, whose scale and direction are inflated and influenced by weak instrument bias and sample overlaps of exposures and outcome GWAS cohorts, respectively. Second, we investigated the asymptotic properties of multivariable IVW and MRBEE, showing that MRBEE outperforms multivariable IVW regarding unbiasedness of causal effect estimation and asymptotic validity of causal inference. Finally, we applied MRBEE to examine myopia and revealed that education and outdoor activity are causal to myopia whereas indoor activity is not.
This paper presents a novel approach to Bayesian nonparametric spectral analysis of stationary multivariate time series. Starting with a parametric vector-autoregressive model, the parametric likelihood is nonparametrically adjusted in the frequency domain to account for potential deviations from parametric assumptions. We show mutual contiguity of the nonparametrically corrected likelihood, the multivariate Whittle likelihood approximation and the exact likelihood for Gaussian time series. A multivariate extension of the nonparametric Bernstein-Dirichlet process prior for univariate spectral densities to the space of Hermitian positive definite spectral density matrices is specified directly on the correction matrices. An infinite series representation of this prior is then used to develop a Markov chain Monte Carlo algorithm to sample from the posterior distribution. The code is made publicly available for ease of use and reproducibility. With this novel approach we provide a generalization of the multivariate Whittle-likelihood-based method of Meier et al. (2020) as well as an extension of the nonparametrically corrected likelihood for univariate stationary time series of Kirch et al. (2019) to the multivariate case. We demonstrate that the nonparametrically corrected likelihood combines the efficiencies of a parametric with the robustness of a nonparametric model. Its numerical accuracy is illustrated in a comprehensive simulation study. We illustrate its practical advantages by a spectral analysis of two environmental time series data sets: a bivariate time series of the Southern Oscillation Index and fish recruitment and time series of windspeed data at six locations in California.
In statistical network analysis it is common to observe so called interaction data. Such data is characterized by actors forming the vertices and interacting along edges of the network, where edges are randomly formed and dissolved over the observation horizon. In addition covariates are observed and the goal is to model the impact of the covariates on the interactions. We distinguish two types of covariates: global, system-wide covariates (i.e. covariates taking the same value for all individuals, such as seasonality) and local, dyadic covariates modeling interactions between two individuals in the network. Existing continuous time network models are extended to allow for comparing a completely parametric model and a model that is parametric only in the local covariates but has a global non-parametric time component. This allows, for instance, to test whether global time dynamics can be explained by simple global covariates like weather, seasonality etc. The procedure is applied to a bike-sharing network by using weather and weekdays as global covariates and distances between the bike stations as local covariates.
The increasing complexity of data requires methods and models that can effectively handle intricate structures, as simplifying them would result in loss of information. While several analytical tools have been developed to work with complex data objects in their original form, these tools are typically limited to single-type variables. In this work, we propose energy trees as a regression and classification model capable of accommodating structured covariates of various types. Energy trees leverage energy statistics to extend the capabilities of conditional inference trees, from which they inherit sound statistical foundations, interpretability, scale invariance, and freedom from distributional assumptions. We specifically focus on functional and graph-structured covariates, while also highlighting the model's flexibility in integrating other variable types. Extensive simulation studies demonstrate the model's competitive performance in terms of variable selection and robustness to overfitting. Finally, we assess the model's predictive ability through two empirical analyses involving human biological data. Energy trees are implemented in the R package etree.
The design of experiments involves a compromise between covariate balance and robustness. This paper provides a formalization of this trade-off and describes an experimental design that allows experimenters to navigate it. The design is specified by a robustness parameter that bounds the worst-case mean squared error of an estimator of the average treatment effect. Subject to the experimenter's desired level of robustness, the design aims to simultaneously balance all linear functions of potentially many covariates. Less robustness allows for more balance. We show that the mean squared error of the estimator is bounded in finite samples by the minimum of the loss function of an implicit ridge regression of the potential outcomes on the covariates. Asymptotically, the design perfectly balances all linear functions of a growing number of covariates with a diminishing reduction in robustness, effectively allowing experimenters to escape the compromise between balance and robustness in large samples. Finally, we describe conditions that ensure asymptotic normality and provide a conservative variance estimator, which facilitate the construction of asymptotically valid confidence intervals.
This paper develops an approximation to the (effective) $p$-resistance and applies it to multi-class clustering. Spectral methods based on the graph Laplacian and its generalization to the graph $p$-Laplacian have been a backbone of non-euclidean clustering techniques. The advantage of the $p$-Laplacian is that the parameter $p$ induces a controllable bias on cluster structure. The drawback of $p$-Laplacian eigenvector based methods is that the third and higher eigenvectors are difficult to compute. Thus, instead, we are motivated to use the $p$-resistance induced by the $p$-Laplacian for clustering. For $p$-resistance, small $p$ biases towards clusters with high internal connectivity while large $p$ biases towards clusters of small ``extent,'' that is a preference for smaller shortest-path distances between vertices in the cluster. However, the $p$-resistance is expensive to compute. We overcome this by developing an approximation to the $p$-resistance. We prove upper and lower bounds on this approximation and observe that it is exact when the graph is a tree. We also provide theoretical justification for the use of $p$-resistance for clustering. Finally, we provide experiments comparing our approximated $p$-resistance clustering to other $p$-Laplacian based methods.
The reduced-rank vector autoregressive (VAR) model can be interpreted as a supervised factor model, where two factor modelings are simultaneously applied to response and predictor spaces. This article introduces a new model, called vector autoregression with common response and predictor factors, to explore further the common structure between the response and predictors in the VAR framework. The new model can provide better physical interpretations and improve estimation efficiency. In conjunction with the tensor operation, the model can easily be extended to any finite-order VAR model. A regularization-based method is considered for the high-dimensional estimation with the gradient descent algorithm, and its computational and statistical convergence guarantees are established. For data with pervasive cross-sectional dependence, a transformation for responses is developed to alleviate the diverging eigenvalue effect. Moreover, we consider additional sparsity structure in factor loading for the case of ultra-high dimension. Simulation experiments confirm our theoretical findings and a macroeconomic application showcases the appealing properties of the proposed model in structural analysis and forecasting.
Class Incremental Learning (CIL) aims at learning a multi-class classifier in a phase-by-phase manner, in which only data of a subset of the classes are provided at each phase. Previous works mainly focus on mitigating forgetting in phases after the initial one. However, we find that improving CIL at its initial phase is also a promising direction. Specifically, we experimentally show that directly encouraging CIL Learner at the initial phase to output similar representations as the model jointly trained on all classes can greatly boost the CIL performance. Motivated by this, we study the difference between a na\"ively-trained initial-phase model and the oracle model. Specifically, since one major difference between these two models is the number of training classes, we investigate how such difference affects the model representations. We find that, with fewer training classes, the data representations of each class lie in a long and narrow region; with more training classes, the representations of each class scatter more uniformly. Inspired by this observation, we propose Class-wise Decorrelation (CwD) that effectively regularizes representations of each class to scatter more uniformly, thus mimicking the model jointly trained with all classes (i.e., the oracle model). Our CwD is simple to implement and easy to plug into existing methods. Extensive experiments on various benchmark datasets show that CwD consistently and significantly improves the performance of existing state-of-the-art methods by around 1\% to 3\%. Code will be released.