In an Markov decision process (MDP), unobservable confounders may exist and have impacts on the data generating process, so that the classic off-policy evaluation (OPE) estimators may fail to identify the true value function of the target policy. In this paper, we study the statistical properties of OPE in confounded MDPs with observable instrumental variables. Specifically, we propose a two-stage estimator based on the instrumental variables and establish its statistical properties in the confounded MDPs with a linear structure. For non-asymptotic analysis, we prove a $\mathcal{O}(n^{-1/2})$-error bound where $n$ is the number of samples. For asymptotic analysis, we prove that the two-stage estimator is asymptotically normal with a typical rate of $n^{1/2}$. To the best of our knowledge, we are the first to show such statistical results of the two-stage estimator for confounded linear MDPs via instrumental variables.
Modified Patankar-Runge-Kutta (MPRK) methods preserve the positivity as well as conservativity of a production-destruction system (PDS) of ordinary differential equations for all time step sizes. As a result, higher order MPRK schemes do not belong to the class of general linear methods, i.e. the iterates are generated by a nonlinear map $\mathbf g$ even when the PDS is linear. Moreover, due to the conservativity of the method, the map $\mathbf g$ possesses non-hyperbolic fixed points. Recently, a new theorem for the investigation of stability properties of non-hyperbolic fixed points of a nonlinear iteration map was developed. We apply this theorem to understand the stability properties of a family of second order MPRK methods when applied to a nonlinear PDS of ordinary differential equations. It is shown that the fixed points are stable for all time step sizes and members of the MPRK family. Finally, experiments are presented to numerically support the theoretical claims.
Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
In this paper, we propose a class of low-rank panel quantile regression models which allow for unobserved slope heterogeneity over both individuals and time. We estimate the heterogeneous intercept and slope matrices via nuclear norm regularization followed by sample splitting, row- and column-wise quantile regressions and debiasing. We show that the estimators of the factors and factor loadings associated with the intercept and slope matrices are asymptotically normally distributed. In addition, we develop two specification tests: one for the null hypothesis that the slope coefficient is a constant over time and/or individuals under the case that true rank of slope matrix equals one, and the other for the null hypothesis that the slope coefficient exhibits an additive structure under the case that the true rank of slope matrix equals two. We illustrate the finite sample performance of estimation and inference via Monte Carlo simulations and real datasets.
Bayesian variable selection methods are powerful techniques for fitting and inferring on sparse high-dimensional linear regression models. However, many are computationally intensive or require restrictive prior distributions on model parameters. Likelihood based penalization methods are more computationally friendly, but resource intensive refitting techniques are needed for inference. In this paper, we proposed an efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are required through the use of plug-in empirical Bayes estimates of hyperparameters. Efficient maximum a posteriori probability (MAP) estimation is completed through the use of a partitioned and extended expectation conditional maximization (ECM) algorithm. The result is a PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse high-dimensional linear regression. We propose methods to estimate credible and prediction intervals for predictions of future values. We compare the empirical properties of predictions and our predictive inference to comparable approaches with numerous simulation studies and an analysis of cancer cell lines drug response study. The proposed approach is implemented in the R package probe.
Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.
High-dimensional matrix-variate time series data are becoming widely available in many scientific fields, such as economics, biology, and meteorology. To achieve significant dimension reduction while preserving the intrinsic matrix structure and temporal dynamics in such data, Wang et al. (2017) proposed a matrix factor model that is shown to provide effective analysis. In this paper, we establish a general framework for incorporating domain or prior knowledge in the matrix factor model through linear constraints. The proposed framework is shown to be useful in achieving parsimonious parameterization, facilitating interpretation of the latent matrix factor, and identifying specific factors of interest. Fully utilizing the prior-knowledge-induced constraints results in more efficient and accurate modeling, inference, dimension reduction as well as a clear and better interpretation of the results. In this paper, constrained, multi-term, and partially constrained factor models for matrix-variate time series are developed, with efficient estimation procedures and their asymptotic properties. We show that the convergence rates of the constrained factor loading matrices are much faster than those of the conventional matrix factor analysis under many situations. Simulation studies are carried out to demonstrate the finite-sample performance of the proposed method and its associated asymptotic properties. We illustrate the proposed model with three applications, where the constrained matrix-factor models outperform their unconstrained counterparts in the power of variance explanation under the out-of-sample 10-fold cross-validation setting.
This paper considers the estimation and inference of the low-rank components in high-dimensional matrix-variate factor models, where each dimension of the matrix-variates ($p \times q$) is comparable to or greater than the number of observations ($T$). We propose an estimation method called $\alpha$-PCA that preserves the matrix structure and aggregates mean and contemporary covariance through a hyper-parameter $\alpha$. We develop an inferential theory, establishing consistency, the rate of convergence, and the limiting distributions, under general conditions that allow for correlations across time, rows, or columns of the noise. We show both theoretical and empirical methods of choosing the best $\alpha$, depending on the use-case criteria. Simulation results demonstrate the adequacy of the asymptotic results in approximating the finite sample properties. The $\alpha$-PCA compares favorably with the existing ones. Finally, we illustrate its applications with a real numeric data set and two real image data sets. In all applications, the proposed estimation procedure outperforms previous methods in the power of variance explanation using out-of-sample 10-fold cross-validation.
In this paper we integrate the isotonic regression with Stone's cross-validation-based method to estimate discrete infinitely supported distribution. We prove that the estimator is strongly consistent, derive its rate of convergence for any underlying distribution, and for one-dimensional case we derive Marshal-type inequality for cumulative distribution function of the estimator. Also, we construct the asymptotically correct conservative global confidence band for the estimator. It is shown that, first, the estimator performs good even for small sized data sets, second, the estimator outperforms in the case of non-monotone underlying distribution, and, third, it performs almost as good as Grenander estimator when the true distribution is isotonic. Therefore, the new estimator provides a trade-off between goodness-of-fit, monotonicity and quality of probabilistic forecast. We apply the estimator to the time-to-onset data of visceral leishmaniasis in Brazil collected from 2007 to 2014.
We introduce a Fourier-based fast algorithm for Gaussian process regression. It approximates a translationally-invariant covariance kernel by complex exponentials on an equispaced Cartesian frequency grid of $M$ nodes. This results in a weight-space $M\times M$ system matrix with Toeplitz structure, which can thus be applied to a vector in ${\mathcal O}(M \log{M})$ operations via the fast Fourier transform (FFT), independent of the number of data points $N$. The linear system can be set up in ${\mathcal O}(N + M \log{M})$ operations using nonuniform FFTs. This enables efficient massive-scale regression via an iterative solver, even for kernels with fat-tailed spectral densities (large $M$). We include a rigorous error analysis of the kernel approximation, the resulting accuracy (relative to "exact" GP regression), and the condition number. Numerical experiments for squared-exponential and Mat\'ern kernels in one, two and three dimensions often show 1-2 orders of magnitude acceleration over state-of-the-art rank-structured solvers at comparable accuracy. Our method allows 2D Mat\'ern-${\small \frac{3}{2}}$ regression from $N=10^9$ data points to be performed in 2 minutes on a standard desktop, with posterior mean accuracy $10^{-3}$. This opens up spatial statistics applications 100 times larger than previously possible.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.