In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. We prove that, if the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first estimator of the data manifold dimension based on diffusion models and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.
We introduce a new method of estimation of parameters in semiparametric and nonparametric models. The method is based on estimating equations that are $U$-statistics in the observations. The $U$-statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the representation cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than $\sqrt n$-rate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at $\sqrt n$-rate, but we also consider efficient $\sqrt n$-estimation using novel nonlinear estimators. The general approach is applied in detail to the example of estimating a mean response when the response is not always observed.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
We study the graph alignment problem over two independent Erd\H{o}s-R\'enyi graphs on $n$ vertices, with edge density $p$ falling into two regimes separated by the critical window around $p_c=\sqrt{\log n/n}$. Our result reveals an algorithmic phase transition for this random optimization problem: polynomial-time approximation schemes exist in the sparse regime, while statistical-computational gap emerges in the dense regime. Additionally, we establish a sharp transition on the performance of online algorithms for this problem when $p$ lies in the dense regime, resulting in a $\sqrt{8/9}$ multiplicative constant factor gap between achievable and optimal solutions.
Despite the progress in medical data collection the actual burden of SARS-CoV-2 remains unknown due to under-ascertainment of cases. This was apparent in the acute phase of the pandemic and the use of reported deaths has been pointed out as a more reliable source of information, likely less prone to under-reporting. Since daily deaths occur from past infections weighted by their probability of death, one may infer the total number of infections accounting for their age distribution, using the data on reported deaths. We adopt this framework and assume that the dynamics generating the total number of infections can be described by a continuous time transmission model expressed through a system of non-linear ordinary differential equations where the transmission rate is modelled as a diffusion process allowing to reveal both the effect of control strategies and the changes in individuals behavior. We develop this flexible Bayesian tool in Stan and study 3 pairs of European countries, estimating the time-varying reproduction number($R_t$) as well as the true cumulative number of infected individuals. As we estimate the true number of infections we offer a more accurate estimate of $R_t$. We also provide an estimate of the daily reporting ratio and discuss the effects of changes in mobility and testing on the inferred quantities.
Image noise is ubiquitous in photography. However, image noise is not compressible nor desirable, thus attempting to convey the noise in compressed image bitstreams yields sub-par results in both rate and distortion. We propose to explicitly learn the image denoising task when training a codec. Therefore, we leverage the Natural Image Noise Dataset, which offers a wide variety of scenes captured with various ISO numbers, leading to different noise levels, including insignificant ones. Given this training set, we supervise the codec with noisy-clean image pairs, and show that a single model trained based on a mixture of images with variable noise levels appears to yield best-in-class results with both noisy and clean images, achieving better rate-distortion than a compression-only model or even than a pair of denoising-then-compression models with almost one order of magnitude fewer GMac operations.
We consider the estimation of factor model-based variance-covariance matrix when the factor loading matrix is assumed sparse. To do so, we rely on a system of penalized estimating functions to account for the identification issue of the factor loading matrix while fostering sparsity in potentially all its entries. We prove the oracle property of the penalized estimator for the factor model when the dimension is fixed. That is, the penalization procedure can recover the true sparse support, and the estimator is asymptotically normally distributed. Consistency and recovery of the true zero entries are established when the number of parameters is diverging. These theoretical results are supported by simulation experiments, and the relevance of the proposed method is illustrated by an application to portfolio allocation.
The state of the art related to parameter correlation in two-parameter models has been reviewed in this paper. The apparent contradictions between the different authors regarding the ability of D--optimality to simultaneously reduce the correlation and the area of the confidence ellipse in two-parameter models were analyzed. Two main approaches were found: 1) those who consider that the optimality criteria simultaneously control the precision and correlation of the parameter estimators; and 2) those that consider a combination of criteria to achieve the same objective. An analytical criterion combining in its structure both the optimality of the precision of the estimators of the parameters and the reduction of the correlation between their estimators is provided. The criterion was tested both in a simple linear regression model, considering all possible design spaces, and in a non-linear model with strong correlation of the estimators of the parameters (Michaelis--Menten) to show its performance. This criterion showed a superior behavior to all the strategies and criteria to control at the same time the precision and the correlation.
Matching and weighting methods for observational studies involve the choice of an estimand, the causal effect with reference to a specific target population. Commonly used estimands include the average treatment effect in the treated (ATT), the average treatment effect in the untreated (ATU), the average treatment effect in the population (ATE), and the average treatment effect in the overlap (i.e., equipoise population; ATO). Each estimand has its own assumptions, interpretation, and statistical methods that can be used to estimate it. This article provides guidance on selecting and interpreting an estimand to help medical researchers correctly implement statistical methods used to estimate causal effects in observational studies and to help audiences correctly interpret the results and limitations of these studies. The interpretations of the estimands resulting from regression and instrumental variable analyses are also discussed. Choosing an estimand carefully is essential for making valid inferences from the analysis of observational data and ensuring results are replicable and useful for practitioners.
The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points. Yet, they can be inaccurate under high-dimensional noise, especially if the noise magnitude varies considerably across the data, e.g., under heteroskedasticity or outliers. In this work, we investigate a more robust alternative -- the doubly stochastic normalization of the Gaussian kernel. We consider a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish that the doubly stochastic affinity matrix and its scaling factors concentrate around certain population forms, and provide corresponding finite-sample probabilistic error bounds. We then utilize these results to develop several tools for robust inference under general high-dimensional noise. First, we derive a robust density estimator that reliably infers the underlying sampling density and can substantially outperform the standard kernel density estimator under heteroskedasticity and outliers. Second, we obtain estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that accurately approximate various manifold Laplacians, including the Laplace Beltrami operator, improving over traditional normalizations in noisy settings. We exemplify our results in simulations and on real single-cell RNA-sequencing data. For the latter, we show that in contrast to traditional methods, our approach is robust to variability in technical noise levels across cell types.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.