We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.
The article introduces a method to learn dynamical systems that are governed by Euler--Lagrange equations from data. The method is based on Gaussian process regression and identifies continuous or discrete Lagrangians and is, therefore, structure preserving by design. A rigorous proof of convergence as the distance between observation data points converges to zero is provided. Next to convergence guarantees, the method allows for quantification of model uncertainty, which can provide a basis of adaptive sampling techniques. We provide efficient uncertainty quantification of any observable that is linear in the Lagrangian, including of Hamiltonian functions (energy) and symplectic structures, which is of interest in the context of system identification. The article overcomes major practical and theoretical difficulties related to the ill-posedness of the identification task of (discrete) Lagrangians through a careful design of geometric regularisation strategies and through an exploit of a relation to convex minimisation problems in reproducing kernel Hilbert spaces.
Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.
We consider the problem of approximating the regression function from noisy vector-valued data by an online learning algorithm using an appropriate reproducing kernel Hilbert space (RKHS) as prior. In an online algorithm, i.i.d. samples become available one by one by a random process and are successively processed to build approximations to the regression function. We are interested in the asymptotic performance of such online approximation algorithms and show that the expected squared error in the RKHS norm can be bounded by $C^2 (m+1)^{-s/(2+s)}$, where $m$ is the current number of processed data, the parameter $0<s\leq 1$ expresses an additional smoothness assumption on the regression function and the constant $C$ depends on the variance of the input noise, the smoothness of the regression function and further parameters of the algorithm.
Count time series data are frequently analyzed by modeling their conditional means and the conditional variance is often considered to be a deterministic function of the corresponding conditional mean and is not typically modeled independently. We propose a semiparametric mean and variance joint model, called random rounded count-valued generalized autoregressive conditional heteroskedastic (RRC-GARCH) model, to address this limitation. The RRC-GARCH model and its variations allow for the joint modeling of both the conditional mean and variance and offer a flexible framework for capturing various mean-variance structures (MVSs). One main feature of this model is its ability to accommodate negative values for regression coefficients and autocorrelation functions. The autocorrelation structure of the RRC-GARCH model using the proposed Laplace link functions with nonnegative regression coefficients is the same as that of an autoregressive moving-average (ARMA) process. For the new model, the stationarity and ergodicity are established and the consistency and asymptotic normality of the conditional least squares estimator are proved. Model selection criteria are proposed to evaluate the RRC-GARCH models. The performance of the RRC-GARCH model is assessed through analyses of both simulated and real data sets. The results indicate that the model can effectively capture the MVS of count time series data and generate accurate forecast means and variances.
We provide rigorous theoretical bounds for Anderson acceleration (AA) that allow for approximate calculations when applied to solve linear problems. We show that, when the approximate calculations satisfy the provided error bounds, the convergence of AA is maintained while the computational time could be reduced. We also provide computable heuristic quantities, guided by the theoretical error bounds, which can be used to automate the tuning of accuracy while performing approximate calculations. For linear problems, the use of heuristics to monitor the error introduced by approximate calculations, combined with the check on monotonicity of the residual, ensures the convergence of the numerical scheme within a prescribed residual tolerance. Motivated by the theoretical studies, we propose a reduced variant of AA, which consists in projecting the least-squares used to compute the Anderson mixing onto a subspace of reduced dimension. The dimensionality of this subspace adapts dynamically at each iteration as prescribed by the computable heuristic quantities. We numerically show and assess the performance of AA with approximate calculations on: (i) linear deterministic fixed-point iterations arising from the Richardson's scheme to solve linear systems with open-source benchmark matrices with various preconditioners and (ii) non-linear deterministic fixed-point iterations arising from non-linear time-dependent Boltzmann equations.
In cluster-randomized experiments, there is emerging interest in exploring the causal mechanism in which a cluster-level treatment affects the outcome through an intermediate outcome. Despite an extensive development of causal mediation methods in the past decade, only a few exceptions have been considered in assessing causal mediation in cluster-randomized studies, all of which depend on parametric model-based estimators. In this article, we develop the formal semiparametric efficiency theory to motivate several doubly-robust methods for addressing several mediation effect estimands corresponding to both the cluster-average and the individual-level treatment effects in cluster-randomized experiments--the natural indirect effect, natural direct effect, and spillover mediation effect. We derive the efficient influence function for each mediation effect, and carefully parameterize each efficient influence function to motivate practical strategies for operationalizing each estimator. We consider both parametric working models and data-adaptive machine learners to estimate the nuisance functions, and obtain semiparametric efficient causal mediation estimators in the latter case. Our methods are illustrated via extensive simulations and two completed cluster-randomized experiments.
We propose a quantum soft-covering problem for a given general quantum channel and one of its output states, which consists in finding the minimum rank of an input state needed to approximate the given channel output. We then prove a one-shot quantum covering lemma in terms of smooth min-entropies by leveraging decoupling techniques from quantum Shannon theory. This covering result is shown to be equivalent to a coding theorem for rate distortion under a posterior (reverse) channel distortion criterion by two of the present authors. Both one-shot results directly yield corollaries about the i.i.d. asymptotics, in terms of the coherent information of the channel. The power of our quantum covering lemma is demonstrated by two additional applications: first, we formulate a quantum channel resolvability problem, and provide one-shot as well as asymptotic upper and lower bounds. Secondly, we provide new upper bounds on the unrestricted and simultaneous identification capacities of quantum channels, in particular separating for the first time the simultaneous identification capacity from the unrestricted one, proving a long-standing conjecture of the last author.
We consider covariance parameter estimation for Gaussian processes with functional inputs. From an increasing-domain asymptotics perspective, we prove the asymptotic consistency and normality of the maximum likelihood estimator. We extend these theoretical guarantees to encompass scenarios accounting for approximation errors in the inputs, which allows robustness of practical implementations relying on conventional sampling methods or projections onto a functional basis. Loosely speaking, both consistency and normality hold when the approximation error becomes negligible, a condition that is often achieved as the number of samples or basis functions becomes large. These later asymptotic properties are illustrated through analytical examples, including one that covers the case of non-randomly perturbed grids, as well as several numerical illustrations.
Twin nodes in a static network capture the idea of being substitutes for each other for maintaining paths of the same length anywhere in the network. In dynamic networks, we model twin nodes over a time-bounded interval, noted $(\Delta,d)$-twins, as follows. A periodic undirected time-varying graph $\mathcal G=(G_t)_{t\in\mathbb N}$ of period $p$ is an infinite sequence of static graphs where $G_t=G_{t+p}$ for every $t\in\mathbb N$. For $\Delta$ and $d$ two integers, two distinct nodes $u$ and $v$ in $\mathcal G$ are $(\Delta,d)$-twins if, starting at some instant, the outside neighbourhoods of $u$ and $v$ has non-empty intersection and differ by at most $d$ elements for $\Delta$ consecutive instants. In particular when $d=0$, $u$ and $v$ can act during the $\Delta$ instants as substitutes for each other in order to maintain journeys of the same length in time-varying graph $\mathcal G$. We propose a distributed deterministic algorithm enabling each node to enumerate its $(\Delta,d)$-twins in $2p$ rounds, using messages of size $O(\delta_\mathcal G\log n)$, where $n$ is the total number of nodes and $\delta_\mathcal G$ is the maximum degree of the graphs $G_t$'s. Moreover, using randomized techniques borrowed from distributed hash function sampling, we reduce the message size down to $O(\log n)$ w.h.p.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.