We study a sequential decision problem where the learner faces a sequence of $K$-armed stochastic bandit tasks. An adversary may design the tasks, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). We design an algorithm based on a reduction to bandit submodular maximization and show that, in the regime of large number of tasks and small number of optimal arms, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.
We give a simplified and improved lower bound for the simplex range reporting problem. We show that given a set $P$ of $n$ points in $\mathbb{R}^d$, any data structure that uses $S(n)$ space to answer such queries must have $Q(n)=\Omega((n^2/S(n))^{(d-1)/d}+k)$ query time, where $k$ is the output size. For near-linear space data structures, i.e., $S(n)=O(n\log^{O(1)}n)$, this improves the previous lower bounds by Chazelle and Rosenberg [CR96] and Afshani [A12] but perhaps more importantly, it is the first ever tight lower bound for any variant of simplex range searching for $d\ge 3$ dimensions. We obtain our lower bound by making a simple connection to well-studied problems in incident geometry which allows us to use known constructions in the area. We observe that a small modification of a simple already existing construction can lead to our lower bound. We believe that our proof is accessible to a much wider audience, at least compared to the previous intricate probabilistic proofs based on measure arguments by Chazelle and Rosenberg [CR96] and Afshani [A12]. The lack of tight or almost-tight (up to polylogarithmic factor) lower bounds for near-linear space data structures is a major bottleneck in making progress on problems such as proving lower bounds for multilevel data structures. It is our hope that this new line of attack based on incidence geometry can lead to further progress in this area.
In practice, optimal screening designs for arbitrary run sizes are traditionally generated using the D-criterion with factor settings fixed at +/- 1, even when considering continuous factors with levels in [-1, 1]. This paper identifies cases of undesirable estimation variance properties for such D-optimal designs and argues that generally A-optimal designs tend to push variances closer to their minimum possible value. New insights about the behavior of the criteria are found through a study of their respective coordinate-exchange formulas. The study confirms the existence of D-optimal designs comprised only of settings +/- 1 for both main effect and interaction models for blocked and un-blocked experiments. Scenarios are also identified for which arbitrary manipulation of a coordinate between [-1, 1] leads to infinitely many D-optimal designs each having different variance properties. For the same conditions, the A-criterion is shown to have a unique optimal coordinate value for improvement. We also compare Bayesian version of the A- and D-criteria in how they balance minimization of estimation variance and bias. Multiple examples of screening designs are considered for various models under Bayesian and non-Bayesian versions of the A- and D-criteria.
Finding multiple solutions of non-convex optimization problems is a ubiquitous yet challenging task. Most past algorithms either apply single-solution optimization methods from multiple random initial guesses or search in the vicinity of found solutions using ad hoc heuristics. We present an end-to-end method to learn the proximal operator of a family of training problems so that multiple local minima can be quickly obtained from initial guesses by iterating the learned operator, emulating the proximal-point algorithm that has fast convergence. The learned proximal operator can be further generalized to recover multiple optima for unseen problems at test time, enabling applications such as object detection. The key ingredient in our formulation is a proximal regularization term, which elevates the convexity of our training loss: by applying recent theoretical results, we show that for weakly-convex objectives with Lipschitz gradients, training of the proximal operator converges globally with a practical degree of over-parameterization. We further present an exhaustive benchmark for multi-solution optimization to demonstrate the effectiveness of our method.
We study the problem of covering and learning sums $X = X_1 + \cdots + X_n$ of independent integer-valued random variables $X_i$ (SIIRVs) with unbounded, or even infinite, support. De et al. at FOCS 2018, showed that the maximum value of the collective support of $X_i$'s necessarily appears in the sample complexity of learning $X$. In this work, we address two questions: (i) Are there general families of SIIRVs with unbounded support that can be learned with sample complexity independent of both $n$ and the maximal element of the support? (ii) Are there general families of SIIRVs with unbounded support that admit proper sparse covers in total variation distance? As for question (i), we provide a set of simple conditions that allow the unbounded SIIRV to be learned with complexity $\text{poly}(1/\epsilon)$ bypassing the aforementioned lower bound. We further address question (ii) in the general setting where each variable $X_i$ has unimodal probability mass function and is a different member of some, possibly multi-parameter, exponential family $\mathcal{E}$ that satisfies some structural properties. These properties allow $\mathcal{E}$ to contain heavy tailed and non log-concave distributions. Moreover, we show that for every $\epsilon > 0$, and every $k$-parameter family $\mathcal{E}$ that satisfies some structural assumptions, there exists an algorithm with $\tilde{O}(k) \cdot \text{poly}(1/\epsilon)$ samples that learns a sum of $n$ arbitrary members of $\mathcal{E}$ within $\epsilon$ in TV distance. The output of the learning algorithm is also a sum of random variables whose distribution lies in the family $\mathcal{E}$. En route, we prove that any discrete unimodal exponential family with bounded constant-degree central moments can be approximated by the family corresponding to a bounded subset of the initial (unbounded) parameter space.
Understanding the impact of the most effective policies or treatments on a response variable of interest is desirable in many empirical works in economics, statistics and other disciplines. Due to the widespread winner's curse phenomenon, conventional statistical inference assuming that the top policies are chosen independent of the random sample may lead to overly optimistic evaluations of the best policies. In recent years, given the increased availability of large datasets, such an issue can be further complicated when researchers include many covariates to estimate the policy or treatment effects in an attempt to control for potential confounders. In this manuscript, to simultaneously address the above-mentioned issues, we propose a resampling-based procedure that not only lifts the winner's curse in evaluating the best policies observed in a random sample, but also is robust to the presence of many covariates. The proposed inference procedure yields accurate point estimates and valid frequentist confidence intervals that achieve the exact nominal level as the sample size goes to infinity for multiple best policy effect sizes. We illustrate the finite-sample performance of our approach through Monte Carlo experiments and two empirical studies, evaluating the most effective policies in charitable giving and the most beneficial group of workers in the National Supported Work program.
Constrained learning is prevalent in many statistical tasks. Recent work proposes distance-to-set penalties to derive estimators under general constraints that can be specified as sets, but focuses on obtaining point estimates that do not come with corresponding measures of uncertainty. To remedy this, we approach distance-to-set regularization from a Bayesian lens. We consider a class of smooth distance-to-set priors, showing that they yield well-defined posteriors toward quantifying uncertainty for constrained learning problems. We discuss relationships and advantages over prior work on Bayesian constraint relaxation. Moreover, we prove that our approach is optimal in an information geometric-sense for finite penalty parameters $\rho$, and enjoys favorable statistical properties when $\rho\to\infty$. The method is designed to perform effectively within gradient-based MCMC samplers, as illustrated on a suite of simulated and real data applications.
Motivated by a recently established result saying that within the class of bivariate Archimedean copulas standard pointwise convergence implies weak convergence of almost all conditional distributions this contribution studies the class $\mathcal{C}_{ar}^d$ of all $d$-dimensional Archimedean copulas with $d \geq 3$ and proves the afore-mentioned implication with respect to conditioning on the first $d-1$ coordinates. Several proper\-ties equivalent to pointwise convergence in $\mathcal{C}_{ar}^d$ are established and - as by-product of working with conditional distributions (Markov kernels) - alternative simple proofs for the well-known formulas for the level set masses $\mu_C(L_t)$ and the Kendall distribution function $F_K^d$ as well as a novel geometrical interpretation of the latter are provided. Viewing normalized generators $\psi$ of $d$-dimensional Archimedean copulas from the perspective of their so-called Williamson measures $\gamma$ on $(0,\infty)$ is then shown to allow not only to derive surprisingly simple expressions for $\mu_C(L_t)$ and $F_K^d$ in terms of $\gamma$ and to characterize pointwise convergence in $\mathcal{C}_{ar}^d$ by weak convergence of the Williamson measures but also to prove that regularity/singularity properties of $\gamma$ directly carry over to the corresponding copula $C_\gamma \in \mathcal{C}_{ar}^d$. These results are finally used to prove the fact that the family of all absolutely continuous and the family of all singular $d$-dimensional copulas is dense in $\mathcal{C}_{ar}^d$ and to underline that despite of their simple algebraic structure Archimedean copulas may exhibit surprisingly singular behavior in the sense of irregularity of their conditional distribution functions.
We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match. In particular, we consider the setting where one side of the market, the arms, has fixed known set of preferences over the other side, the players. While this problem has been studied when the players have fixed but unknown preferences, in this work we study the problem of how to learn when the preferences of the players are time varying. We propose the {\it Restart Competing Bandits (RCB)} algorithm, which combines a simple {\it restart strategy} to handle the non-stationarity with the {\it competing bandits} algorithm \citep{liu2020competing} designed for the stationary case. We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of {$\widetilde{\mathcal{O}}(L^{1/2}_TT^{1/2})$} up to the number of changes in the underlying preference of agents, $L_T$. We also discuss extensions of this algorithm to the case where the number of changes need not be known a priori.
Derivatives are a key nonparametric functional in wide-ranging applications where the rate of change of an unknown function is of interest. In the Bayesian paradigm, Gaussian processes (GPs) are routinely used as a flexible prior for unknown functions, and are arguably one of the most popular tools in many areas. However, little is known about the optimal modelling strategy and theoretical properties when using GPs for derivatives. In this article, we study a plug-in strategy by differentiating the posterior distribution with GP priors for derivatives of any order. This practically appealing plug-in GP method has been previously perceived as suboptimal and degraded, but this is not necessarily the case. We provide posterior contraction rates for plug-in GPs and establish that they remarkably adapt to derivative orders. We show that the posterior measure of the regression function and its derivatives, with the same choice of hyperparameter that does not depend on the order of derivatives, converges at the minimax optimal rate up to a logarithmic factor for functions in certain classes. This to the best of our knowledge provides the first positive result for plug-in GPs in the context of inferring derivative functionals, and leads to a practically simple nonparametric Bayesian method with guided hyperparameter tuning for simultaneously estimating the regression function and its derivatives. Simulations show competitive finite sample performance of the plug-in GP method. A climate change application on analyzing the global sea-level rise is discussed.
Motivated by applications in cloud computing spot markets and selling banner ads on popular websites, we study the online resource allocation problem with "costly buyback". To model this problem, we consider the classic edge-weighted fractional online matching problem with a tweak, where the decision maker can recall (i.e., buyback) any fraction of an offline resource that is pre-allocated to an earlier online vertex; however, by doing so not only the decision maker loses the previously allocated reward (which equates the edge-weight), it also has to pay a non-negative constant factor $f$ of this edge-weight as an extra penalty. Parameterizing the problem by the buyback factor $f$, our main result is obtaining optimal competitive algorithms for all possible values of $f$ through a novel primal-dual family of algorithms. We establish the optimality of our results by obtaining separate lower-bounds for each of small and large buyback factor regimes, and showing how our primal-dual algorithm exactly matches this lower-bound by appropriately tuning a parameter as a function of $f$. We further study lower and upper bounds on the competitive ratio in variants of this model, e.g., single-resource with different demand sizes, or matching with deterministic integral allocations. We show how algorithms in the our family of primal-dual algorithms can obtain the exact optimal competitive ratio in all of these variants -- which in turn demonstrates the power of our algorithmic framework for online resource allocations with costly buyback.