The robust Poisson model is becoming increasingly popular when estimating the association of exposures with a binary outcome. Unlike the logistic regression model, the robust Poisson model yields results that can be interpreted as risk or prevalence ratios. In addition, it does not suffer from frequent non-convergence problems like the log-binomial model. However, using a Poisson distribution to model a binary outcome may seem counterintuitive. Methodological papers have often presented this as a good approximation to the more natural binomial distribution. In this paper, we provide an alternative perspective to the robust Poisson model based on the semiparametric theory. This perspective highlights that the robust Poisson model does not require assuming a Poisson distribution for the outcome. In fact, the model can be seen as making no assumption on the distribution of the outcome; only a log-linear relationship assumption between the risk/prevalence of the outcome and the explanatory variables is required. This assumption and consequences of its violation are discussed. Suggestions to reduce the risk of violating the modeling assumption are also provided.
While in recent years a number of new statistical approaches have been proposed to model group differences with a different assumption on the nature of the measurement invariance of the instruments, the tools for detecting local misspecifications of these models have not been fully developed yet. In this study, we present a novel approach using a Deep Neural Network (DNN). We compared the proposed model with the most popular traditional methods: Modification Indices (MI) and Expected Parameter Change (EPC) indicators from the Confirmatory Factor Analysis (CFA) modeling, logistic DIF detection, and sequential procedure introduced with the CFA alignment approach. Simulation studies show that the proposed method outperformed traditional methods in almost all scenarios, or it was at least as accurate as the best one. We also provide an empirical example utilizing European Social Survey data including items known to be miss-translated, which are correctly identified with presented DNN approach.
Modern deep reinforcement learning (RL) algorithms are motivated by either the general policy improvement (GPI) or trust-region learning (TRL) frameworks. However, algorithms that strictly respect these theoretical frameworks have proven unscalable. Surprisingly, the only known scalable algorithms violate the GPI/TRL assumptions, e.g. due to required regularisation or other heuristics. The current explanation of their empirical success is essentially "by analogy": they are deemed approximate adaptations of theoretically sound methods. Unfortunately, studies have shown that in practice these algorithms differ greatly from their conceptual ancestors. In contrast, in this paper we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO. While the latter two exploit the flexibility of our framework, GPI and TRL fit in merely as pathologically restrictive corner cases thereof. This suggests that the empirical performance of state-of-the-art methods is a direct consequence of their theoretical properties, rather than of aforementioned approximate analogies. Mirror learning sets us free to boldly explore novel, theoretically sound RL algorithms, a thus far uncharted wonderland.
When the distribution of treatment effect modifiers differs between the trial sample and target population, inverse probability weighting (IPSW) can be applied to achieve an unbiased estimate of the population average treatment effect in the target population. The statistical validity of IPSW is threatened when there are missing data in the target population, including potential missingness in trial sample. However, missing data methods have not been adequately discussed in the current literature. We conducted a set of simulation studies to determine how to apply multiple imputation (MI) in the context of IPSW. We specifically addressed questions such as which variables to include in the imputation model and whether they should come from trial or non-trial portion of the target population. Based on our findings, we recommend including all potential effect modifiers and trial indicator from both trial and non-trial populations, as well as treatment and outcome variables from trial sample in the imputation model as main effects. Additionally, we have illustrated ideas by transporting findings from the Frequent Hemodialysis Network (FHN) Daily Trial to the United States Renal Stage System (USRDS) population.
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this problem without introducing additional harms. We conduct a large-scale empirical study of DRO and several variations of standard learning procedures to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data. In the course of our evaluation, we introduce an extension to DRO approaches that allows for specification of the metric used to assess worst-case performance. We conduct the analysis for models that predict in-hospital mortality, prolonged length of stay, and 30-day readmission for inpatient admissions, and predict in-hospital mortality using intensive care data. We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures using the entire training dataset. These results imply that when it is of interest to improve model performance for patient subpopulations beyond what can be achieved with standard practices, it may be necessary to do so via data collection techniques that increase the effective sample size or reduce the level of noise in the prediction problem.
This paper studies Quasi Maximum Likelihood estimation of Dynamic Factor Models for large panels of time series. Specifically, we consider the case in which the autocorrelation of the factors is explicitly accounted for, and therefore the model has a state-space form. Estimation of the factors and their loadings is implemented through the Expectation Maximization (EM) algorithm, jointly with the Kalman smoother.~We prove that as both the dimension of the panel $n$ and the sample size $T$ diverge to infinity, up to logarithmic terms: (i) the estimated loadings are $\sqrt T$-consistent and asymptotically normal if $\sqrt T/n\to 0$; (ii) the estimated factors are $\sqrt n$-consistent and asymptotically normal if $\sqrt n/T\to 0$; (iii) the estimated common component is $\min(\sqrt n,\sqrt T)$-consistent and asymptotically normal regardless of the relative rate of divergence of $n$ and $T$. Although the model is estimated as if the idiosyncratic terms were cross-sectionally and serially uncorrelated and normally distributed, we show that these mis-specifications do not affect consistency. Moreover, the estimated loadings are asymptotically as efficient as those obtained with the Principal Components estimator, while the estimated factors are more efficient if the idiosyncratic covariance is sparse enough.~We then propose robust estimators of the asymptotic covariances, which can be used to conduct inference on the loadings and to compute confidence intervals for the factors and common components. Finally, we study the performance of our estimators and we compare them with the traditional Principal Components approach through MonteCarlo simulations and analysis of US macroeconomic data.
We present a Bayesian nonparametric model for conditional distribution estimation using Bayesian additive regression trees (BART). The generative model we use is based on rejection sampling from a base model. Typical of BART models, our model is flexible, has a default prior specification, and is computationally convenient. To address the distinguished role of the response in the BART model we propose, we further introduce an approach to targeted smoothing which is possibly of independent interest for BART models. We study the proposed model theoretically and provide sufficient conditions for the posterior distribution to concentrate at close to the minimax optimal rate adaptively over smoothness classes in the high-dimensional regime in which many predictors are irrelevant. To fit our model we propose a data augmentation algorithm which allows for existing BART samplers to be extended with minimal effort. We illustrate the performance of our methodology on simulated data and use it to study the relationship between education and body mass index using data from the medical expenditure panel survey (MEPS).
Conventional speech spoofing countermeasures (CMs) are designed to make a binary decision on an input trial. However, a CM trained on a closed-set database is theoretically not guaranteed to perform well on unknown spoofing attacks. In some scenarios, an alternative strategy is to let the CM defer a decision when it is not confident. The question is then how to estimate a CM's confidence regarding an input trial. We investigated a few confidence estimators that can be easily plugged into a CM. On the ASVspoof2019 logical access database, the results demonstrate that an energy-based estimator and a neural-network-based one achieved acceptable performance in identifying unknown attacks in the test set. On a test set with additional unknown attacks and bona fide trials from other databases, the confidence estimators performed moderately well, and the CMs better discriminated bona fide and spoofed trials that had a high confidence score. Additional results also revealed the difficulty in enhancing a confidence estimator by adding unknown attacks to the training set.
We propose a new wavelet-based method for density estimation when the data are size-biased. More specifically, we consider a power of the density of interest, where this power exceeds 1/2. Warped wavelet bases are employed, where warping is attained by some continuous cumulative distribution function. This can be seen as a general framework in which the conventional orthonormal wavelet estimation is the case where warping distribution is the standard uniform c.d.f. We show that both linear and nonlinear wavelet estimators are consistent, with optimal and/or near-optimal rates. Monte Carlo simulations are performed to compare four special settings which are easy to interpret in practice. An application with a real dataset on fatal traffic accidents involving alcohol illustrates the method. We observe that warped bases provide more flexible and superior estimates for both simulated and real data. Moreover, we find that estimating the power of a density (for instance, its square root) further improves the results.
Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or ReLU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.
Training datasets for machine learning often have some form of missingness. For example, to learn a model for deciding whom to give a loan, the available training data includes individuals who were given a loan in the past, but not those who were not. This missingness, if ignored, nullifies any fairness guarantee of the training procedure when the model is deployed. Using causal graphs, we characterize the missingness mechanisms in different real-world scenarios. We show conditions under which various distributions, used in popular fairness algorithms, can or can not be recovered from the training data. Our theoretical results imply that many of these algorithms can not guarantee fairness in practice. Modeling missingness also helps to identify correct design principles for fair algorithms. For example, in multi-stage settings where decisions are made in multiple screening rounds, we use our framework to derive the minimal distributions required to design a fair algorithm. Our proposed algorithm decentralizes the decision-making process and still achieves similar performance to the optimal algorithm that requires centralization and non-recoverable distributions.