A very classical problem in statistics is to test the stochastic superiority of one distribution to another. However, many existing approaches are developed for independent samples and, moreover, do not take censored data into account. We develop a new estimand-driven method to compare the effectiveness of two treatments in the context of right-censored survival data with matched pairs. With the help of competing risks techniques, the so-called relative treatment effect is estimated. It quantifies the probability that the individual undergoing the first treatment survives the matched individual undergoing the second treatment. Hypothesis tests and confidence intervals are based on a studentized version of the estimator, where resampling-based inference is established by means of a randomization method. In a simulation study, we found that the developed test exhibits good power, when compared to competitors which are actually testing the simpler null hypothesis of the equality of both marginal survival functions. Finally, we apply the methodology to a well-known benchmark data set from a trial with patients suffering from with diabetic retinopathy.
We consider the problem of estimating the roughness of the volatility in a stochastic volatility model that arises as a nonlinear function of fractional Brownian motion with drift. To this end, we introduce a new estimator that measures the so-called roughness exponent of a continuous trajectory, based on discrete observations of its antiderivative. We provide conditions on the underlying trajectory under which our estimator converges in a strictly pathwise sense. Then we verify that these conditions are satisfied by almost every sample path of fractional Brownian motion (with drift). As a consequence, we obtain strong consistency theorems in the context of a large class of rough volatility models. Numerical simulations show that our estimation procedure performs well after passing to a scale-invariant modification of our estimator.
Despite its importance for insurance, there is almost no literature on statistical hail damage modeling. Statistical models for hailstorms exist, though they are generally not open-source, but no study appears to have developed a stochastic hail impact function. In this paper, we use hail-related insurance claim data to build a Gaussian line process with extreme marks to model both the geographical footprint of a hailstorm and the damage to buildings that hailstones can cause. We build a model for the claim counts and claim values, and compare it to the use of a benchmark deterministic hail impact function. Our model proves to be better than the benchmark at capturing hail spatial patterns and allows for localized and extreme damage, which is seen in the insurance data. The evaluation of both the claim counts and value predictions shows that performance is improved compared to the benchmark, especially for extreme damage. Our model appears to be the first to provide realistic estimates for hail damage to individual buildings.
Large datasets are often affected by cell-wise outliers in the form of missing or erroneous data. However, discarding any samples containing outliers may result in a dataset that is too small to accurately estimate the covariance matrix. Moreover, most robust procedures designed to address this problem are not effective on high-dimensional data as they rely crucially on invertibility of the covariance operator. In this paper, we propose an unbiased estimator for the covariance in the presence of missing values that does not require any imputation step and still achieves minimax statistical accuracy with the operator norm. We also advocate for its use in combination with cell-wise outlier detection methods to tackle cell-wise contamination in a high-dimensional and low-rank setting, where state-of-the-art methods may suffer from numerical instability and long computation times. To complement our theoretical findings, we conducted an experimental study which demonstrates the superiority of our approach over the state of the art both in low and high dimension settings.
A new information theoretic condition is presented for reconstructing a discrete random variable $X$ based on the knowledge of a set of discrete functions of $X$. The reconstruction condition is derived from Shannon's 1953 lattice theory with two entropic metrics of Shannon and Rajski. Because such a theoretical material is relatively unknown and appears quite dispersed in different references, we first provide a synthetic description (with complete proofs) of its concepts, such as total, common and complementary informations. Definitions and properties of the two entropic metrics are also fully detailed and shown compatible with the lattice structure. A new geometric interpretation of such a lattice structure is then investigated that leads to a necessary (and sometimes sufficient) condition for reconstructing the discrete random variable $X$ given a set $\{ X_1,\ldots,X_{n} \}$ of elements in the lattice generated by $X$. Finally, this condition is illustrated in five specific examples of perfect reconstruction problems: reconstruction of a symmetric random variable from the knowledge of its sign and absolute value, reconstruction of a word from a set of linear combinations, reconstruction of an integer from its prime signature (fundamental theorem of arithmetic) and from its remainders modulo a set of coprime integers (Chinese remainder theorem), and reconstruction of the sorting permutation of a list from a minimal set of pairwise comparisons.
We present a rigorous and precise analysis of the maximum degree and the average degree in a dynamic duplication-divergence graph model introduced by Sol\'e, Pastor-Satorras et al. in which the graph grows according to a duplication-divergence mechanism, i.e. by iteratively creating a copy of some node and then randomly alternating the neighborhood of a new node with probability $p$. This model captures the growth of some real-world processes e.g. biological or social networks. In this paper, we prove that for some $0 < p < 1$ the maximum degree and the average degree of a duplication-divergence graph on $t$ vertices are asymptotically concentrated with high probability around $t^p$ and $\max\{t^{2 p - 1}, 1\}$, respectively, i.e. they are within at most a polylogarithmic factor from these values with probability at least $1 - t^{-A}$ for any constant $A > 0$.
This article study the average conditioning for a random underdetermined polynomial system. The expected value of the moments of the condition number are compared to the moments of the condition number of random matrices. An expression for these moments is given by studying the kernel finding problem for random matrices. Furthermore, the second moment of the Frobenius condition number is computed.
This note presents a refined local approximation for the logarithm of the ratio between the negative multinomial probability mass function and a multivariate normal density, both having the same mean-covariance structure. This approximation, which is derived using Stirling's formula and a meticulous treatment of Taylor expansions, yields an upper bound on the Hellinger distance between the jittered negative multinomial distribution and the corresponding multivariate normal distribution. Upper bounds on the Le Cam distance between negative multinomial and multivariate normal experiments ensue.
Particle methods based on evolving the spatial derivatives of the solution were originally introduced to simulate reaction-diffusion processes, inspired by vortex methods for the Navier--Stokes equations. Such methods, referred to as gradient random walk methods, were extensively studied in the '90s and have several interesting features, such as being grid free, automatically adapting to the solution by concentrating elements where the gradient is large and significantly reducing the variance of the standard random walk approach. In this work, we revive these ideas by showing how to generalize the approach to a larger class of partial differential equations, including hyperbolic systems of conservation laws. To achieve this goal, we first extend the classical Monte Carlo method to relaxation approximation of systems of conservation laws, and subsequently consider a novel particle dynamics based on the spatial derivatives of the solution. The methodology, combined with asymptotic-preserving splitting discretization, yields a way to construct a new class of gradient-based Monte Carlo methods for hyperbolic systems of conservation laws. Several results in one spatial dimension for scalar equations and systems of conservation laws show that the new methods are very promising and yield remarkable improvements compared to standard Monte Carlo approaches, either in terms of variance reduction as well as in describing the shock structure.
Computing the marginal likelihood (also called the Bayesian model evidence) is an important task in Bayesian model selection, providing a principled quantitative way to compare models. The learned harmonic mean estimator solves the exploding variance problem of the original harmonic mean estimation of the marginal likelihood. The learned harmonic mean estimator learns an importance sampling target distribution that approximates the optimal distribution. While the approximation need not be highly accurate, it is critical that the probability mass of the learned distribution is contained within the posterior in order to avoid the exploding variance problem. In previous work a bespoke optimization problem is introduced when training models in order to ensure this property is satisfied. In the current article we introduce the use of normalizing flows to represent the importance sampling target distribution. A flow-based model is trained on samples from the posterior by maximum likelihood estimation. Then, the probability density of the flow is concentrated by lowering the variance of the base distribution, i.e. by lowering its "temperature", ensuring its probability mass is contained within the posterior. This approach avoids the need for a bespoke optimisation problem and careful fine tuning of parameters, resulting in a more robust method. Moreover, the use of normalizing flows has the potential to scale to high dimensional settings. We present preliminary experiments demonstrating the effectiveness of the use of flows for the learned harmonic mean estimator. The harmonic code implementing the learned harmonic mean, which is publicly available, has been updated to now support normalizing flows.
We used survival analysis to quantify the impact of postdischarge evaluation and management (E/M) services in preventing hospital readmission or death. Our approach avoids a specific pitfall of applying machine learning to this problem, which is an inflated estimate of the effect of interventions, due to survivors bias -- where the magnitude of inflation may be conditional on heterogeneous confounders in the population. This bias arises simply because in order to receive an intervention after discharge, a person must not have been readmitted in the intervening period. After deriving an expression for this phantom effect, we controlled for this and other biases within an inherently interpretable Bayesian survival framework. We identified case management services as being the most impactful for reducing readmissions overall.