This paper is concerned with the lossy compression of general random variables, specifically with rate-distortion theory and quantization of random variables taking values in general measurable spaces such as, e.g., manifolds and fractal sets. Manifold structures are prevalent in data science, e.g., in compressed sensing, machine learning, image processing, and handwritten digit recognition. Fractal sets find application in image compression and in the modeling of Ethernet traffic. Our main contributions are bounds on the rate-distortion function and the quantization error. These bounds are very general and essentially only require the existence of reference measures satisfying certain regularity conditions in terms of small ball probabilities. To illustrate the wide applicability of our results, we particularize them to random variables taking values in i) manifolds, namely, hyperspheres and Grassmannians, and ii) self-similar sets characterized by iterated function systems satisfying the weak separation property.
Semi-definite programs represent a frontier of efficient computation. While there has been much progress on semi-definite optimization, with moderate-sized instances currently solvable in practice by the interior-point method, the basic problem of sampling semi-definite solutions remains a formidable challenge. The direct application of known polynomial-time algorithms for sampling general convex bodies to semi-definite sampling leads to a prohibitively high running time. In addition, known general methods require an expensive rounding phase as pre-processing. Here we analyze the Dikin walk, by first adapting it to general metrics, then devising suitable metrics for the PSD cone with affine constraints. The resulting mixing time and per-step complexity are considerably smaller, and by an appropriate choice of the metric, the dependence on the number of constraints can be made polylogarithmic. We introduce a refined notion of self-concordant matrix functions and give rules for combining different metrics. Along the way, we further develop the theory of interior-point methods for sampling.
Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments.
Functional sliced inverse regression (FSIR) is one of the most popular algorithms for functional sufficient dimension reduction (FSDR). However, the choice of slice scheme in FSIR is critical but challenging. In this paper, we propose a new method called functional slicing-free inverse regression (FSFIR) to estimate the central subspace in FSDR. FSFIR is based on the martingale difference divergence operator, which is a novel metric introduced to characterize the conditional mean independence of a functional predictor on a multivariate response. We also provide a specific convergence rate for the FSFIR estimator. Compared with existing functional sliced inverse regression methods, FSFIR does not require the selection of a slice number. Simulations demonstrate the efficiency and convenience of FSFIR.
There has been debate on whether the hazard function should be used for causal inference in time-to-event studies. The main criticism is that there is selection bias because the risk sets beyond the first event time are comprised of subsets of survivors who are no longer balanced in the risk factors, even in the absence of unmeasured confounding, measurement error, and model misspecification. In this short communication we use the potential outcomes framework and the single-world intervention graph to show that there is indeed no selection bias when estimating the average treatment effect, and that the hazard ratio over time can provide a useful interpretation in practical settings.
Convergence rate analyses of random walk Metropolis-Hastings Markov chains on general state spaces have largely focused on establishing sufficient conditions for geometric ergodicity or on analysis of mixing times. Geometric ergodicity is a key sufficient condition for the Markov chain Central Limit Theorem and allows rigorous approaches to assessing Monte Carlo error. The sufficient conditions for geometric ergodicity of the random walk Metropolis-Hastings Markov chain are refined and extended, which allows the analysis of previously inaccessible settings such as Bayesian Poisson regression. The key technical innovation is the development of explicit drift and minorization conditions for random walk Metropolis-Hastings, which allows explicit upper and lower bounds on the geometric rate of convergence. Further, lower bounds on the geometric rate of convergence are also developed using spectral theory. The existing sufficient conditions for geometric ergodicity, to date, have not provided explicit constraints on the rate of geometric rate of convergence because the method used only implies the existence of drift and minorization conditions. The theoretical results are applied to random walk Metropolis-Hastings algorithms for a class of exponential families and generalized linear models that address Bayesian Regression problems.
Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.
We study partially linear models in settings where observations are arranged in independent groups but may exhibit within-group dependence. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximising a (restricted) likelihood from random effects modelling or by using generalised estimating equations. We introduce a new 'sandwich loss' whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in linear parameter estimation accuracy when they are not. Under relatively mild conditions, our estimated coefficients are asymptotically Gaussian and enjoy minimal variance among estimators with weights restricted to a given class of functions, when user-chosen regression methods are used to estimate nuisance functions. We further expand the class of functional forms for the weights that may be fitted beyond parametric models by leveraging the flexibility of modern machine learning methods within a new gradient boosting scheme for minimising the sandwich loss. We demonstrate the effectiveness of both the sandwich loss and what we call 'sandwich boosting' in a variety of settings with simulated and real-world data.
In observational studies, unobserved confounding is a major barrier in isolating the average causal effect (ACE). In these scenarios, two main approaches are often used: confounder adjustment for causality (CAC) and instrumental variable analysis for causation (IVAC). Nevertheless, both are subject to untestable assumptions and, therefore, it may be unclear which assumption violation scenarios one method is superior in terms of mitigating inconsistency for the ACE. Although general guidelines exist, direct theoretical comparisons of the trade-offs between CAC and the IVAC assumptions are limited. Using ordinary least squares (OLS) for CAC and two-stage least squares (2SLS) for IVAC, we analytically compare the relative inconsistency for the ACE of each approach under a variety of assumption violation scenarios and discuss rules of thumb for practice. Additionally, a sensitivity framework is proposed to guide analysts in determining which approach may result in less inconsistency for estimating the ACE with a given dataset. We demonstrate our findings both through simulation and an application examining whether maternal stress during pregnancy affects a neonate's birthweight. The implications of our findings for causal inference practice are discussed, providing guidance for analysts for judging whether CAC or IVAC may be more appropriate for a given situation.
Flexible modeling of how an entire distribution changes with covariates is an important yet challenging generalization of mean-based regression that has seen growing interest over the past decades in both the statistics and machine learning literature. This review outlines selected state-of-the-art statistical approaches to distributional regression, complemented with alternatives from machine learning. Topics covered include the similarities and differences between these approaches, extensions, properties and limitations, estimation procedures, and the availability of software. In view of the increasing complexity and availability of large-scale data, this review also discusses the scalability of traditional estimation methods, current trends, and open challenges. Illustrations are provided using data on childhood malnutrition in Nigeria and Australian electricity prices.
Principal Component Analysis (PCA) is a fundamental tool for data visualization, denoising, and dimensionality reduction. It is widely popular in Statistics, Machine Learning, Computer Vision, and related fields. However, PCA is well-known to fall prey to outliers and often fails to detect the true underlying low-dimensional structure within the dataset. Following the Median of Means (MoM) philosophy, recent supervised learning methods have shown great success in dealing with outlying observations without much compromise to their large sample theoretical properties. This paper proposes a PCA procedure based on the MoM principle. Called the \textbf{M}edian of \textbf{M}eans \textbf{P}rincipal \textbf{C}omponent \textbf{A}nalysis (MoMPCA), the proposed method is not only computationally appealing but also achieves optimal convergence rates under minimal assumptions. In particular, we explore the non-asymptotic error bounds of the obtained solution via the aid of the Rademacher complexities while granting absolutely no assumption on the outlying observations. The derived concentration results are not dependent on the dimension because the analysis is conducted in a separable Hilbert space, and the results only depend on the fourth moment of the underlying distribution in the corresponding norm. The proposal's efficacy is also thoroughly showcased through simulations and real data applications.