Measurements are generally collected as unilateral or bilateral data in clinical trials or observational studies. For example, in ophthalmology studies, the primary outcome is often obtained from one eye or both eyes of an individual. In medical studies, the relative risk is usually the parameter of interest and is commonly used. In this article, we develop three confidence intervals for the relative risk for combined unilateral and bilateral correlated data under the equal dependence assumption. The proposed confidence intervals are based on maximum likelihood estimates of parameters derived using the Fisher scoring method. Simulation studies are conducted to evaluate the performance of proposed confidence intervals with respect to the empirical coverage probability, the mean interval width, and the ratio of mesial non-coverage probability to the distal non-coverage probability. We also compare the proposed methods with the confidence interval based on the method of variance estimates recovery and the confidence interval obtained from the modified Poisson regression model with correlated binary data. We recommend the score confidence interval for general applications because it best controls converge probabilities at the 95% level with reasonable mean interval width. We illustrate the methods with a real-world example.
This paper introduces a Laplace approximation to Bayesian inference in Dirichlet regression models, which can be used to analyze a set of variables on a simplex exhibiting skewness and heteroscedasticity, without having to transform the data. These data, which mainly consist of proportions or percentages of disjoint categories, are widely known as compositional data and are common in areas such as ecology, geology, and psychology. We provide both the theoretical foundations and a description of how Laplace approximation can be implemented in the case of Dirichlet regression. The paper also introduces the package dirinla in the R-language that extends the R-INLA package, which can not deal directly with Dirichlet likelihoods. Simulation studies are presented to validate the good behaviour of the proposed method, while a real data case-study is used to show how this approach can be applied.
We propose two-stage and sequential procedures to construct prescribed proportional closeness confidence intervals for the unknown parameter N of a binomial distribution with unknown parameter p, when we reinforce data with an independent sample of a negative-binomial experiment having the same p
To draw real-world evidence about the comparative effectiveness of multiple time-varying treatment regimens on patient survival, we develop a joint marginal structural proportional hazards model and novel weighting schemes in continuous time to account for time-varying confounding and censoring. Our methods formulate complex longitudinal treatments with multiple ``start/stop'' switches as the recurrent events with discontinuous intervals of treatment eligibility. We derive the weights in continuous time to handle a complex longitudinal dataset on its own terms, without the need to discretize or artificially align the measurement times. We further propose using machine learning models designed for censored survival data with time-varying covariates and the kernel function estimator of the baseline intensity to efficiently estimate the continuous-time weights. Our simulations demonstrate that the proposed methods provide better bias reduction and nominal coverage probability when analyzing observational longitudinal survival data with irregularly spaced time intervals, compared to conventional methods that require aligned measurement time points. We apply the proposed methods to a large-scale COVID-19 dataset to estimate the causal effects of several COVID-19 treatment strategies on in-hospital mortality or ICU admission, and provide new insights relative to findings from randomized trials.
Let $X$ be a random variable with unknown mean and finite variance. We present a new estimator of the mean of $X$ that is robust with respect to the possible presence of outliers in the sample, provides tight sub-Gaussian deviation guarantees without any additional assumptions on the shape or tails of the distribution, and moreover is asymptotically efficient. This is the first estimator that provably combines all these qualities in one package. Our construction is inspired by robustness properties possessed by the self-normalized sums. Theoretical findings are supplemented by numerical simulations highlighting strong performance of the proposed estimator in comparison with previously known techniques.
We develop a new method to estimate an ARMA model in the presence of big time series data. Using the concept of a rolling average, we develop a new efficient algorithm, called Rollage, to estimate the order of an AR model and subsequently fit the model. When used in conjunction with an existing methodology, specifically Durbin's algorithm, we show that our proposed method can be used as a criterion to optimally fit ARMA models. Empirical results on large-scale synthetic time series data support the theoretical results and reveal the efficacy of this new approach, especially when compared to existing methodology.
Variance estimation is important for statistical inference. It becomes non-trivial when observations are masked by serial dependence structures and time-varying mean structures. Existing methods either ignore or sub-optimally handle these nuisance structures. This paper develops a general framework for the estimation of the long-run variance for time series with non-constant means. The building blocks are difference statistics. The proposed class of estimators is general enough to cover many existing estimators. Necessary and sufficient conditions for consistency are investigated. The first asymptotically optimal estimator is derived. Our proposed estimator is theoretically proven to be invariant to arbitrary mean structures, which may include trends and a possibly divergent number of discontinuities.
The one-fifth success rule is one of the best-known and most widely accepted techniques to control the parameters of evolutionary algorithms. While it is often applied in the literal sense, a common interpretation sees the one-fifth success rule as a family of success-based updated rules that are determined by an update strength $F$ and a success rate. We analyze in this work how the performance of the (1+1) Evolutionary Algorithm on LeadingOnes depends on these two hyper-parameters. Our main result shows that the best performance is obtained for small update strengths $F=1+o(1)$ and success rate $1/e$. We also prove that the running time obtained by this parameter setting is, apart from lower order terms, the same that is achieved with the best fitness-dependent mutation rate. We show similar results for the resampling variant of the (1+1) Evolutionary Algorithm, which enforces to flip at least one bit per iteration.
I construct and justify confidence intervals for longitudinal causal parameters estimated with machine learning. Longitudinal parameters include long term, dynamic, and mediated effects. I provide a nonasymptotic theorem for any longitudinal causal parameter estimated with any machine learning algorithm that satisfies a few simple, interpretable conditions. The main result encompasses local parameters defined for specific demographics as well as proximal parameters defined in the presence of unobserved confounding. Formally, I prove consistency, Gaussian approximation, and semiparametric efficiency. The rate of convergence is $n^{-1/2}$ for global parameters, and it degrades gracefully for local parameters. I articulate a simple set of conditions to translate mean square rates into statistical inference. A key feature of the main result is a new multiple robustness to ill posedness for proximal causal inference in longitudinal settings.
Despite the state-of-the-art performance for medical image segmentation, deep convolutional neural networks (CNNs) have rarely provided uncertainty estimations regarding their segmentation outputs, e.g., model (epistemic) and image-based (aleatoric) uncertainties. In this work, we analyze these different types of uncertainties for CNN-based 2D and 3D medical image segmentation tasks. We additionally propose a test-time augmentation-based aleatoric uncertainty to analyze the effect of different transformations of the input image on the segmentation output. Test-time augmentation has been previously used to improve segmentation accuracy, yet not been formulated in a consistent mathematical framework. Hence, we also propose a theoretical formulation of test-time augmentation, where a distribution of the prediction is estimated by Monte Carlo simulation with prior distributions of parameters in an image acquisition model that involves image transformations and noise. We compare and combine our proposed aleatoric uncertainty with model uncertainty. Experiments with segmentation of fetal brains and brain tumors from 2D and 3D Magnetic Resonance Images (MRI) showed that 1) the test-time augmentation-based aleatoric uncertainty provides a better uncertainty estimation than calculating the test-time dropout-based model uncertainty alone and helps to reduce overconfident incorrect predictions, and 2) our test-time augmentation outperforms a single-prediction baseline and dropout-based multiple predictions.
From only positive (P) and unlabeled (U) data, a binary classifier could be trained with PU learning, in which the state of the art is unbiased PU learning. However, if its model is very flexible, empirical risks on training data will go negative, and we will suffer from serious overfitting. In this paper, we propose a non-negative risk estimator for PU learning: when getting minimized, it is more robust against overfitting, and thus we are able to use very flexible models (such as deep neural networks) given limited P data. Moreover, we analyze the bias, consistency, and mean-squared-error reduction of the proposed risk estimator, and bound the estimation error of the resulting empirical risk minimizer. Experiments demonstrate that our risk estimator fixes the overfitting problem of its unbiased counterparts.