Variance reduction techniques have been successfully applied to temporal-difference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less popular one time-scale TD algorithm or the two time-scale GTD algorithm but with a finite number of i.i.d.\ samples, and both algorithms apply to only the on-policy setting. In this work, we develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting and analyze its non-asymptotic convergence rate over both i.i.d.\ and Markovian samples. In the i.i.d.\ setting, our algorithm {matches the best-known lower bound $\tilde{O}(\epsilon^{-1}$).} In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity $O(\epsilon^{-1} \log {\epsilon}^{-1})$ that is near-optimal. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller asymptotic convergence error than both the conventional TDC and the variance-reduced TD.
We observe n possibly dependent random variables, the distribution of which is presumed to be stationary even though this might not be true, and we aim at estimating the stationary distribution. We establish a non-asymptotic deviation bound for the Hellinger distance between the target distribution and our estimator. If the dependence within the observations is small, the estimator performs as good as if the data were independent and identically distributed. In addition our estimator is robust to misspecification and contamination. If the dependence is too high but the observed process is mixing, we can select a subset of observations that is almost independent and retrieve results similar to what we have in the i.i.d. case. We apply our procedure to the estimation of the invariant distribution of a diffusion process and to finite state space hidden Markov models.
We study multiclass classification in the agnostic adversarial online learning setting. As our main result, we prove that any multiclass concept class is agnostically learnable if and only if its Littlestone dimension is finite. This solves an open problem studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011,2015) who handled the case when the number of classes (or labels) is bounded. We also prove a separation between online learnability and online uniform convergence by exhibiting an easy-to-learn class whose sequential Rademacher complexity is unbounded. Our learning algorithm uses the multiplicative weights algorithm, with a set of experts defined by executions of the Standard Optimal Algorithm on subsequences of size Littlestone dimension. We argue that the best expert has regret at most Littlestone dimension relative to the best concept in the class. This differs from the well-known covering technique of Ben-David, P\'{a}l, and Shalev-Shwartz (2009) for binary classification, where the best expert has regret zero.
Tests based on heteroskedasticity robust standard errors are an important technique in econometric practice. Choosing the right critical value, however, is not simple at all: conventional critical values based on asymptotics often lead to severe size distortions; and so do existing adjustments including the bootstrap. To avoid these issues, we suggest to use smallest size-controlling critical values, the generic existence of which we prove in this article for the commonly used test statistics. Furthermore, sufficient and often also necessary conditions for their existence are given that are easy to check. Granted their existence, these critical values are the canonical choice: larger critical values result in unnecessary power loss, whereas smaller critical values lead to over-rejections under the null hypothesis, make spurious discoveries more likely, and thus are invalid. We suggest algorithms to numerically determine the proposed critical values and provide implementations in accompanying software. Finally, we numerically study the behavior of the proposed testing procedures, including their power properties.
In the pairwise weighted spanner problem, the input consists of an $n$-vertex-directed graph, where each edge is assigned a cost and a length. Given $k$ vertex pairs and a distance constraint for each pair, the goal is to find a minimum-cost subgraph in which the distance constraints are satisfied. This formulation captures many well-studied connectivity problems, including spanners, distance preservers, and Steiner forests. In the offline setting, we show: 1. An $\tilde{O}(n^{4/5 + \epsilon})$-approximation algorithm for pairwise weighted spanners. When the edges have unit costs and lengths, the best previous algorithm gives an $\tilde{O}(n^{3/5 + \epsilon})$-approximation, due to Chlamt\'a\v{c}, Dinitz, Kortsarz, and Laekhanukit (TALG, 2020). 2. An $\tilde{O}(n^{1/2+\epsilon})$-approximation algorithm for all-pair weighted distance preservers. When the edges have unit costs and arbitrary lengths, the best previous algorithm gives an $\tilde{O}(n^{1/2})$-approximation for all-pair spanners, due to Berman, Bhattacharyya, Makarychev, Raskhodnikova, and Yaroslavtsev (Information and Computation, 2013). In the online setting, we show: 1. An $\tilde{O}(k^{1/2 + \epsilon})$-competitive algorithm for pairwise weighted spanners. The state-of-the-art results are $\tilde{O}(n^{4/5})$-competitive when edges have unit costs and arbitrary lengths, and $\min\{\tilde{O}(k^{1/2 + \epsilon}), \tilde{O}(n^{2/3 + \epsilon})\}$-competitive when edges have unit costs and lengths, due to Grigorescu, Lin, and Quanrud (APPROX, 2021). 2. An $\tilde{O}(k^{\epsilon})$-competitive algorithm for single-source weighted spanners. Without distance constraints, this problem is equivalent to the directed Steiner tree problem. The best previous algorithm for online directed Steiner trees is $\tilde{O}(k^{\epsilon})$-competitive, due to Chakrabarty, Ene, Krishnaswamy, and Panigrahi (SICOMP, 2018).
The increasing availability of high-dimensional, longitudinal measures of genetic expression can facilitate analysis of the biological mechanisms of disease and prediction of future trajectories, as required for precision medicine. Biological knowledge suggests that it may be best to describe complex diseases at the level of underlying pathways, which may interact with one another. We propose a Bayesian approach that allows for characterising such correlation among different pathways through Dependent Gaussian Processes (DGP) and mapping the observed high-dimensional gene expression trajectories into unobserved low-dimensional pathway expression trajectories via Bayesian Sparse Factor Analysis. Compared to previous approaches that model each pathway expression trajectory independently, our model demonstrates better performance in recovering the shape of pathway expression trajectories, revealing the relationships between genes and pathways, and predicting gene expressions (closer point estimates and narrower predictive intervals), as demonstrated in the simulation study and real data analysis. To fit the model, we propose a Monte Carlo Expectation Maximization (MCEM) scheme that can be implemented conveniently by combining a standard Markov Chain Monte Carlo sampler and an R package GPFDA (Konzen and others, 2021), which returns the maximum likelihood estimates of DGP parameters. The modular structure of MCEM makes it generalizable to other complex models involving the DGP model component. An R package has been developed that implements the proposed approach.
Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method that differs from the D-optimal design but requires lower computing time. We present a simulation study, comparing both subsampling schemes with the IBOSS method.
We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data's inputs belong to one of $K$ discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error. When the labeling cost is $B$, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $\widetilde{O}(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$ on the loss after $T$ rounds. We also provide a more nuanced upper bound which demonstrates that the algorithm can adapt to the arrival pattern, and achieves better performance when the arrival pattern is more favorable. We complement both upper bounds with matching lower bounds. We next study this problem when the inputs belong to a continuous domain and the output of the experiment is a smooth function with bounded RKHS norm. After $T$ rounds in $d$ dimensions, we show that the loss is bounded by $\widetilde{O}(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$ in an RKHS with a squared exponential kernel and by $\widetilde{O}(B^{\frac{1}{2d+3}} T^{\frac{2d+2}{2d+3}})$ in an RKHS with a Mat\'ern kernel. Our empirical evaluation demonstrates that our method outperforms other baselines in several synthetic experiments and two real experiments in medicine and astronomy.
Probability density estimation is a core problem of statistics and signal processing. Moment methods are an important means of density estimation, but they are generally strongly dependent on the choice of feasible functions, which severely affects the performance. In this paper, we propose a non-classical parametrization for density estimation using sample moments, which does not require the choice of such functions. The parametrization is induced by the squared Hellinger distance, and the solution of it, which is proved to exist and be unique subject to a simple prior that does not depend on data, and can be obtained by convex optimization. Statistical properties of the density estimator, together with an asymptotic error upper bound are proposed for the estimator by power moments. Applications of the proposed density estimator in signal processing tasks are given. Simulation results validate the performance of the estimator by a comparison to several prevailing methods. To the best of our knowledge, the proposed estimator is the first one in the literature for which the power moments up to an arbitrary even order exactly match the sample moments, while the true density is not assumed to fall within specific function classes.
Supervised learning problems with side information in the form of a network arise frequently in applications in genomics, proteomics and neuroscience. For example, in genetic applications, the network side information can accurately capture background biological information on the intricate relations among the relevant genes. In this paper, we initiate a study of Bayes optimal learning in high-dimensional linear regression with network side information. To this end, we first introduce a simple generative model (called the Reg-Graph model) which posits a joint distribution for the supervised data and the observed network through a common set of latent parameters. Next, we introduce an iterative algorithm based on Approximate Message Passing (AMP) which is provably Bayes optimal under very general conditions. In addition, we characterize the limiting mutual information between the latent signal and the data observed, and thus precisely quantify the statistical impact of the network side information. Finally, supporting numerical experiments suggest that the introduced algorithm has excellent performance in finite samples.
Experimental and observational studies often lack validity due to untestable assumptions. We propose a double machine learning approach to combine experimental and observational studies, allowing practitioners to test for assumption violations and estimate treatment effects consistently. Our framework tests for violations of external validity and ignorability under milder assumptions. When only one assumption is violated, we provide semi-parametrically efficient treatment effect estimators. However, our no-free-lunch theorem highlights the necessity of accurately identifying the violated assumption for consistent treatment effect estimation. We demonstrate the applicability of our approach in three real-world case studies, highlighting its relevance for practical settings.