A data set sampled from a certain population is biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying proportions. Training machine learning models on biased data sets requires correction techniques to compensate for the bias. We consider two commonly-used techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function. Though statistically equivalent, it has been observed that resampling outperforms reweighting when combined with stochastic gradient algorithms. By analyzing illustrative examples, we explain the reason behind this phenomenon using tools from dynamical stability and stochastic asymptotics. We also present experiments from regression, classification, and off-policy prediction to demonstrate that this is a general phenomenon. We argue that it is imperative to consider the objective function design and the optimization algorithm together while addressing the sampling bias.
We introduce a new algorithm for expected log-likelihood maximization in situations where the objective function is multi-modal and/or has saddle points, that we term G-PFSO. The key idea underpinning G-PFSO is to define a sequence of probability distributions which (a) is shown to concentrate on the target parameter value and (b) can be efficiently estimated by means of a standard particle filter algorithm. These distributions depends on a learning rate, where the faster the learning rate is the faster is the rate at which they concentrate on the desired parameter value but the lesser is the ability of G-PFSO to escape from a local optimum of the objective function. To conciliate ability to escape from a local optimum and fast convergence rate, the proposed estimator exploits the acceleration property of averaging, well-known in the stochastic gradient literature. Based on challenging estimation problems, our numerical experiments suggest that the estimator introduced in this paper converges at the optimal rate, and illustrate the practical usefulness of G-PFSO for parameter inference in large datasets. If the focus of this work is expected log-likelihood maximization the proposed approach and its theory apply more generally for optimizing a function defined through an expectation.
Monte Carlo estimation in plays a crucial role in stochastic reaction networks. However, reducing the statistical uncertainty of the corresponding estimators requires sampling a large number of trajectories. We propose control variates based on the statistical moments of the process to reduce the estimators' variances. We develop an algorithm that selects an efficient subset of infinitely many control variates. To this end, the algorithm uses resampling and a redundancy-aware greedy selection. We demonstrate the efficiency of our approach in several case studies.
We study a matrix recovery problem with unknown correspondence: given the observation matrix $M_o=[A,\tilde P B]$, where $\tilde P$ is an unknown permutation matrix, we aim to recover the underlying matrix $M=[A,B]$. Such problem commonly arises in many applications where heterogeneous data are utilized and the correspondence among them are unknown, e.g., due to privacy concerns. We show that it is possible to recover $M$ via solving a nuclear norm minimization problem under a proper low-rank condition on $M$, with provable non-asymptotic error bound for the recovery of $M$. We propose an algorithm, $\text{M}^3\text{O}$ (Matrix recovery via Min-Max Optimization) which recasts this combinatorial problem as a continuous minimax optimization problem and solves it by proximal gradient with a Max-Oracle. $\text{M}^3\text{O}$ can also be applied to a more general scenario where we have missing entries in $M_o$ and multiple groups of data with distinct unknown correspondence. Experiments on simulated data, the MovieLens 100K dataset and Yale B database show that $\text{M}^3\text{O}$ achieves state-of-the-art performance over several baselines and can recover the ground-truth correspondence with high accuracy.
In an influential critique of empirical practice, Freedman \cite{freedman2008A,freedman2008B} showed that the linear regression estimator was biased for the analysis of randomized controlled trials under the randomization model. Under Freedman's assumptions, we derive exact closed-form bias corrections for the linear regression estimator with and without treatment-by-covariate interactions. We show that the limiting distribution of the bias corrected estimator is identical to the uncorrected estimator, implying that the asymptotic gains from adjustment can be attained without introducing any risk of bias. Taken together with results from Lin \cite{lin2013agnostic}, our results show that Freedman's theoretical arguments against the use of regression adjustment can be completely resolved with minor modifications to practice.
In machine learning, stochastic gradient descent (SGD) is widely deployed to train models using highly non-convex objectives with equally complex noise models. Unfortunately, SGD theory often makes restrictive assumptions that fail to capture the non-convexity of real problems, and almost entirely ignore the complex noise models that exist in practice. In this work, we make substantial progress on this shortcoming. First, we establish that SGD's iterates will either globally converge to a stationary point or diverge under nearly arbitrary nonconvexity and noise models. Under a slightly more restrictive assumption on the joint behavior of the non-convexity and noise model that generalizes current assumptions in the literature, we show that the objective function cannot diverge, even if the iterates diverge. As a consequence of our results, SGD can be applied to a greater range of stochastic optimization problems with confidence about its global convergence behavior and stability.
Penalized logistic regression is extremely useful for binary classification with a large number of covariates (significantly higher than the sample size), having several real life applications, including genomic disease classification. However, the existing methods based on the likelihood based loss function are sensitive to data contamination and other noise and, hence, robust methods are needed for stable and more accurate inference. In this paper, we propose a family of robust estimators for sparse logistic models utilizing the popular density power divergence based loss function and the general adaptively weighted LASSO penalties. We study the local robustness of the proposed estimators through its influence function and also derive its oracle properties and asymptotic distribution. With extensive empirical illustrations, we clearly demonstrate the significantly improved performance of our proposed estimators over the existing ones with particular gain in robustness. Our proposal is finally applied to analyse four different real datasets for cancer classification, obtaining robust and accurate models, that simultaneously performs gene selection and patient classification.
A precision matrix is the inverse of a covariance matrix. In this paper, we study the problem of estimating the precision matrix with a known graphical structure under high-dimensional settings. We propose a simple estimator of the precision matrix based on the connection between the known graphical structure and the precision matrix. We obtain the rates of convergence of the proposed estimators and derive the asymptotic normality of the proposed estimator in the high-dimensional setting when the data dimension grows with the sample size. Numerical simulations are conducted to demonstrate the performance of the proposed method. We also show that the proposed method outperforms some existing methods that do not utilize the graphical structure information.
Proximal Policy Optimization (PPO) is a highly popular model-free reinforcement learning (RL) approach. However, in continuous state and actions spaces and a Gaussian policy -- common in computer animation and robotics -- PPO is prone to getting stuck in local optima. In this paper, we observe a tendency of PPO to prematurely shrink the exploration variance, which naturally leads to slow progress. Motivated by this, we borrow ideas from CMA-ES, a black-box optimization method designed for intelligent adaptive Gaussian exploration, to derive PPO-CMA, a novel proximal policy optimization approach that can expand the exploration variance on objective function slopes and shrink the variance when close to the optimum. This is implemented by using separate neural networks for policy mean and variance and training the mean and variance in separate passes. Our experiments demonstrate a clear improvement over vanilla PPO in many difficult OpenAI Gym MuJoCo tasks.
Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space, such as the simplex, the time-discretisation error can dominate when we are near the boundary of the space. We demonstrate that while current SGMCMC methods for the simplex perform well in certain cases, they struggle with sparse simplex spaces; when many of the components are close to zero. However, most popular large-scale applications of Bayesian inference on simplex spaces, such as network or topic models, are sparse. We argue that this poor performance is due to the biases of SGMCMC caused by the discretization error. To get around this, we propose the stochastic CIR process, which removes all discretization error and we prove that samples from the stochastic CIR process are asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.