We present a new finite-sample analysis of M-estimators of locations in $\mathbb{R}^d$ using the tool of the influence function. In particular, we show that the deviations of an M-estimator can be controlled thanks to its influence function (or its score function) and then, we use concentration inequality on M-estimators to investigate the robust estimation of the mean in high dimension in a corrupted setting (adversarial corruption setting) for bounded and unbounded score functions. For a sample of size $n$ and covariance matrix $\Sigma$, we attain the minimax speed $\sqrt{Tr(\Sigma)/n}+\sqrt{\|\Sigma\|_{op}\log(1/\delta)/n}$ with probability larger than $1-\delta$ in a heavy-tailed setting. One of the major advantages of our approach compared to others recently proposed is that our estimator is tractable and fast to compute even in very high dimension with a complexity of $O(nd\log(Tr(\Sigma)))$ where $n$ is the sample size and $\Sigma$ is the covariance matrix of the inliers. In practice, the code that we make available for this article proves to be very fast.
The asymptotic behaviour of Linear Spectral Statistics (LSS) of the smoothed periodogram estimator of the spectral coherency matrix of a complex Gaussian high-dimensional time series $(\y_n)_{n \in \mathbb{Z}}$ with independent components is studied under the asymptotic regime where the sample size $N$ converges towards $+\infty$ while the dimension $M$ of $\y$ and the smoothing span of the estimator grow to infinity at the same rate in such a way that $\frac{M}{N} \rightarrow 0$. It is established that, at each frequency, the estimated spectral coherency matrix is close from the sample covariance matrix of an independent identically $\mathcal{N}_{\mathbb{C}}(0,\I_M)$ distributed sequence, and that its empirical eigenvalue distribution converges towards the Marcenko-Pastur distribution. This allows to conclude that each LSS has a deterministic behaviour that can be evaluated explicitly. Using concentration inequalities, it is shown that the order of magnitude of the supremum over the frequencies of the deviation of each LSS from its deterministic approximation is of the order of $\frac{1}{M} + \frac{\sqrt{M}}{N}+ (\frac{M}{N})^{3}$ where $N$ is the sample size. Numerical simulations supports our results.
Bayesian inference allows to obtain useful information on the parameters of models, either in computational statistics or more recently in the context of Bayesian Neural Networks. The computational cost of usual Monte Carlo methods for sampling a posteriori laws in Bayesian inference scales linearly with the number of data points. One option to reduce it to a fraction of this cost is to resort to mini-batching in conjunction with unadjusted discretizations of Langevin dynamics, in which case only a random fraction of the data is used to estimate the gradient. However, this leads to an additional noise in the dynamics and hence a bias on the invariant measure which is sampled by the Markov chain. We advocate using the so-called Adaptive Langevin dynamics, which is a modification of standard inertial Langevin dynamics with a dynamical friction which automatically corrects for the increased noise arising from mini-batching. We investigate the practical relevance of the assumptions underpinning Adaptive Langevin (constant covariance for the estimation of the gradient), which are not satisfied in typical models of Bayesian inference, and quantify the bias induced by minibatching in this case. We also show how to extend AdL in order to systematically reduce the bias on the posterior distribution by considering a dynamical friction depending on the current value of the parameter to sample.
In this work, we focus our attention on the study of the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a theoretical and empirical analysis as to why different properties of the data distribution can contribute to regulating sources of algorithmic instability. First, we revisit theoretical bounds on the performance of approximate dynamic programming algorithms. Second, we provide a novel four-state MDP that highlights the impact of the data distribution in the performance of a Q-learning algorithm with function approximation, both in online and offline settings. Finally, we experimentally assess the impact of the data distribution properties in the performance of an offline deep Q-network algorithm. Our results show that: (i) the data distribution needs to possess certain properties in order to robustly learn in an offline setting, namely low distance to the distributions induced by optimal policies of the MDP and high coverage over the state-action space; and (ii) high entropy data distributions can contribute to mitigating sources of algorithmic instability.
Density ratio estimation (DRE) is a fundamental machine learning technique for comparing two probability distributions. However, existing methods struggle in high-dimensional settings, as it is difficult to accurately compare probability distributions based on finite samples. In this work we propose DRE-\infty, a divide-and-conquer approach to reduce DRE to a series of easier subproblems. Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions. We then estimate the instantaneous rate of change of the bridge distributions indexed by time (the "time score") -- a quantity defined analogously to data (Stein) scores -- with a novel time score matching objective. Crucially, the learned time scores can then be integrated to compute the desired density ratio. In addition, we show that traditional (Stein) scores can be used to obtain integration paths that connect regions of high density in both distributions, improving performance in practice. Empirically, we demonstrate that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets.
It is highly desirable to know how uncertain a model's predictions are, especially for models that are complex and hard to understand as in deep learning. Although there has been a growing interest in using deep learning methods in diffusion-weighted MRI, prior works have not addressed the issue of model uncertainty. Here, we propose a deep learning method to estimate the diffusion tensor and compute the estimation uncertainty. Data-dependent uncertainty is computed directly by the network and learned via loss attenuation. Model uncertainty is computed using Monte Carlo dropout. We also propose a new method for evaluating the quality of predicted uncertainties. We compare the new method with the standard least-squares tensor estimation and bootstrap-based uncertainty computation techniques. Our experiments show that when the number of measurements is small the deep learning method is more accurate and its uncertainty predictions are better calibrated than the standard methods. We show that the estimation uncertainties computed by the new method can highlight the model's biases, detect domain shift, and reflect the strength of noise in the measurements. Our study shows the importance and practical value of modeling prediction uncertainties in deep learning-based diffusion MRI analysis.
In this short note, we provide a sample complexity lower bound for learning linear predictors with respect to the squared loss. Our focus is on an agnostic setting, where no assumptions are made on the data distribution. This contrasts with standard results in the literature, which either make distributional assumptions, refer to specific parameter settings, or use other performance measures.
In this paper, we investigate the construction of polar codes by Gaussian approximation (GA) and develop an approach based on piecewise Gaussian approximation (PGA). In particular, with the piecewise approach we obtain a function that replaces the original GA function with a more accurate approximation, which results in significant gain in performance. The proposed PGA construction of polar codes is presented in its integral form as well as an alternative approximation that does not rely on the integral form. Simulations results show that the proposed PGA construction outperforms the standard GA for several examples of polar codes and rates.
In this paper, we consider the problem of determining the presence of a given signal in a high-dimensional observation with unknown covariance matrix by using an adaptive matched filter. Traditionally such filters are formed from the sample covariance matrix of some given training data, but, as is well-known, the performance of such filters is poor when the number of training data $n$ is not much larger than the data dimension $p$. We thus seek a covariance estimator to replace sample covariance. To account for the fact that $n$ and $p$ may be of comparable size, we adopt the "large-dimensional asymptotic model" in which $n$ and $p$ go to infinity in a fixed ratio. Under this assumption, we identify a covariance estimator that is asymptotically detection-theoretic optimal within a general shrinkage class inspired by C. Stein, and we give consistent estimates for conditional false-alarm and detection rate of the corresponding adaptive matched filter.
Recent advances in the theoretical understanding of SGD led to a formula for the optimal batch size minimizing the number of effective data passes, i.e., the number of iterations times the batch size. However, this formula is of no practical value as it depends on the knowledge of the variance of the stochastic gradients evaluated at the optimum. In this paper we design a practical SGD method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions. Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour; that is, it works as if the optimal batch size was known a-priori. Further, we generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
We propose \textit{average Localisation-Recall-Precision} (aLRP), a unified, bounded, balanced and ranking-based loss function for both classification and localisation tasks in object detection. aLRP extends the Localisation-Recall-Precision (LRP) performance metric (Oksuz et al., 2018) inspired from how Average Precision (AP) Loss extends precision to a ranking-based loss function for classification (Chen et al., 2020). aLRP has the following distinct advantages: (i) aLRP is the first ranking-based loss function for both classification and localisation tasks. (ii) Thanks to using ranking for both tasks, aLRP naturally enforces high-quality localisation for high-precision classification. (iii) aLRP provides provable balance between positives and negatives. (iv) Compared to on average $\sim$6 hyperparameters in the loss functions of state-of-the-art detectors, aLRP Loss has only one hyperparameter, which we did not tune in practice. On the COCO dataset, aLRP Loss improves its ranking-based predecessor, AP Loss, up to around $5$ AP points, achieves $48.9$ AP without test time augmentation and outperforms all one-stage detectors. Code available at: //github.com/kemaloksuz/aLRPLoss .