We study best-arm identification with a fixed budget and contextual (covariate) information in stochastic multi-armed bandit problems. In each round, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. First, we derive semiparametric lower bounds for this problem, where we regard the gaps between the expected rewards of the best and suboptimal treatment arms as parameters of interest, and all other parameters, such as the expected rewards conditioned on contexts, as the nuisance parameters. We then develop the "Contextual RS-AIPW strategy," which consists of the random sampling (RS) rule tracking a target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed Contextual RS-AIPW strategy is optimal because the upper bound for the probability of misidentification matches the semiparametric lower bound when the budget goes to infinity, and the gaps converge to zero.
We propose a numerical algorithm for the computation of multi-marginal optimal transport (MMOT) problems involving general measures that are not necessarily discrete. By developing a relaxation scheme in which marginal constraints are replaced by finitely many linear constraints and by proving a specifically tailored duality result for this setting, we approximate the MMOT problem by a linear semi-infinite optimization problem. Moreover, we are able to recover a feasible and approximately optimal solution of the MMOT problem, and its sub-optimality can be controlled to be arbitrarily close to 0 under mild conditions. The developed relaxation scheme leads to a numerical algorithm which can compute a feasible approximate optimizer of the MMOT problem whose theoretical sub-optimality can be chosen to be arbitrarily small. Besides the approximate optimizer, the algorithm is also able to compute both an upper bound and a lower bound on the optimal value of the MMOT problem. The difference between the computed bounds provides an explicit upper bound on the sub-optimality of the computed approximate optimizer. Through a numerical example, we demonstrate that the proposed algorithm is capable of computing a high-quality solution of an MMOT problem involving as many as 50 marginals along with an explicit estimate of its sub-optimality that is much less conservative compared to the theoretical estimate.
We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the minimax optimal rate as a result of an optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the probability of misidentification, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To address this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
Identification of nonlinear systems is a challenging problem. Physical knowledge of the system can be used in the identification process to significantly improve the predictive performance by restricting the space of possible mappings from the input to the output. Typically, the physical models contain unknown parameters that must be learned from data. Classical methods often restrict the possible models or have to resort to approximations of the model that introduce biases. Sequential Monte Carlo methods enable learning without introducing any bias for a more general class of models. In addition, they can also be used to approximate a posterior distribution of the model parameters in a Bayesian setting. This article provides a general introduction to sequential Monte Carlo and shows how it naturally fits in system identification by giving examples of specific algorithms. The methods are illustrated on two systems: a system with two cascaded water tanks with possible overflow in both tanks and a compartmental model for the spreading of a disease.
Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.
We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth selection, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent, and significantly reduces the MSE compared to baseline OPE methods through experiments on various domains.
Elimination algorithms for bandit identification, which prune the plausible correct answers sequentially until only one remains, are computationally convenient since they reduce the problem size over time. However, existing elimination strategies are often not fully adaptive (they update their sampling rule infrequently) and are not easy to extend to combinatorial settings, where the set of answers is exponentially large in the problem dimension. On the other hand, most existing fully-adaptive strategies to tackle general identification problems are computationally demanding since they repeatedly test the correctness of every answer, without ever reducing the problem size. We show that adaptive methods can be modified to use elimination in both their stopping and sampling rules, hence obtaining the best of these two worlds: the algorithms (1) remain fully adaptive, (2) suffer a sample complexity that is never worse of their non-elimination counterpart, and (3) provably eliminate certain wrong answers early. We confirm these benefits experimentally, where elimination improves significantly the computational complexity of adaptive methods on common tasks like best-arm identification in linear bandits.
Understanding the impact of the most effective policies or treatments on a response variable of interest is desirable in many empirical works in economics, statistics and other disciplines. Due to the widespread winner's curse phenomenon, conventional statistical inference assuming that the top policies are chosen independent of the random sample may lead to overly optimistic evaluations of the best policies. In recent years, given the increased availability of large datasets, such an issue can be further complicated when researchers include many covariates to estimate the policy or treatment effects in an attempt to control for potential confounders. In this manuscript, to simultaneously address the above-mentioned issues, we propose a resampling-based procedure that not only lifts the winner's curse in evaluating the best policies observed in a random sample, but also is robust to the presence of many covariates. The proposed inference procedure yields accurate point estimates and valid frequentist confidence intervals that achieve the exact nominal level as the sample size goes to infinity for multiple best policy effect sizes. We illustrate the finite-sample performance of our approach through Monte Carlo experiments and two empirical studies, evaluating the most effective policies in charitable giving and the most beneficial group of workers in the National Supported Work program.
One common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points. The framework includes several well-established procedures, such as the penalized likelihood and minimum description length. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming when (i) the data sequence is long and (ii) obtaining the cost value involves solving a non-trivial optimization problem. This paper introduces a new sequential method (SE) that can be coupled with gradient descent (SeGD) and quasi-Newton's method (SeN) to find the cost value effectively. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. The new method is applied to change-point detection in generalized linear models and penalized regression. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time (PELT) method without sacrificing estimation accuracy.
Simulation-based Bayesian inference (SBI) can be used to estimate the parameters of complex mechanistic models given observed model outputs without requiring access to explicit likelihood evaluations. A prime example for the application of SBI in neuroscience involves estimating the parameters governing the response dynamics of Hodgkin-Huxley (HH) models from electrophysiological measurements, by inferring a posterior over the parameters that is consistent with a set of observations. To this end, many SBI methods employ a set of summary statistics or scientifically interpretable features to estimate a surrogate likelihood or posterior. However, currently, there is no way to identify how much each summary statistic or feature contributes to reducing posterior uncertainty. To address this challenge, one could simply compare the posteriors with and without a given feature included in the inference process. However, for large or nested feature sets, this would necessitate repeatedly estimating the posterior, which is computationally expensive or even prohibitive. Here, we provide a more efficient approach based on the SBI method neural likelihood estimation (NLE): We show that one can marginalize the trained surrogate likelihood post-hoc before inferring the posterior to assess the contribution of a feature. We demonstrate the usefulness of our method by identifying the most important features for inferring parameters of an example HH neuron model. Beyond neuroscience, our method is generally applicable to SBI workflows that rely on data features for inference used in other scientific fields.
One of the most important features of financial time series data is volatility. There are often structural changes in volatility over time, and an accurate estimation of the volatility of financial time series requires careful identification of change-points. A common approach to modeling the volatility of time series data is the well-known GARCH model. Although the problem of change-point estimation of volatility dynamics derived from the GARCH model has been considered in the literature, these approaches rely on parametric assumptions of the conditional error distribution, which are often violated in financial time series. This may lead to inaccuracies in change-point detection resulting in unreliable GARCH volatility estimates. This paper introduces a novel change-point detection algorithm based on a semiparametric GARCH model. The proposed method retains the structural advantages of the GARCH process while incorporating the flexibility of nonparametric conditional error distribution. The approach utilizes a penalized likelihood derived from a semiparametric GARCH model and an efficient binary segmentation algorithm. The results show that in terms of change-point estimation and detection accuracy, the semiparametric method outperforms the commonly used Quasi-MLE (QMLE) and other variations of GARCH models in wide-ranging scenarios.