The greedy algorithm for monotone submodular function maximization subject to cardinality constraint is guaranteed to approximate the optimal solution to within a $1-1/e$ factor. Although it is well known that this guarantee is essentially tight in the worst case -- for greedy and in fact any efficient algorithm, experiments show that greedy performs better in practice. We observe that for many applications in practice, the empirical distribution of the budgets (i.e., cardinality constraints) is supported on a wide range, and moreover, all the existing hardness results in theory break under a large perturbation of the budget. To understand the effect of the budget from both algorithmic and hardness perspectives, we introduce a new notion of budget smoothed analysis. We prove that greedy is optimal for every budget distribution, and we give a characterization for the worst-case submodular functions. Based on these results, we show that on the algorithmic side, under realistic budget distributions, greedy and related algorithms enjoy provably better approximation guarantees, that hold even for worst-case functions, and on the hardness side, there exist hard functions that are fairly robust to all the budget distributions.
Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.
Thompson sampling (TS) has attracted a lot of interest in the bandit area. It was introduced in the 1930s but has not been theoretically proven until recent years. All of its analysis in the combinatorial multi-armed bandit (CMAB) setting requires an exact oracle to provide optimal solutions with any input. However, such an oracle is usually not feasible since many combinatorial optimization problems are NP-hard and only approximation oracles are available. An example (Wang and Chen, 2018) has shown the failure of TS to learn with an approximation oracle. However, this oracle is uncommon and is designed only for a specific problem instance. It is still an open question whether the convergence analysis of TS can be extended beyond the exact oracle in CMAB. In this paper, we study this question under the greedy oracle, which is a common (approximation) oracle with theoretical guarantees to solve many (offline) combinatorial optimization problems. We provide a problem-dependent regret lower bound of order $\Omega(\log T/\Delta^2)$ to quantify the hardness of TS to solve CMAB problems with greedy oracle, where $T$ is the time horizon and $\Delta$ is some reward gap. We also provide an almost matching regret upper bound. These are the first theoretical results for TS to solve CMAB with a common approximation oracle and break the misconception that TS cannot work with approximation oracles.
The single-site dynamics are a canonical class of Markov chains for sampling from high-dimensional probability distributions, e.g. the ones represented by graphical models. We give a simple and generic parallel algorithm that can faithfully simulate single-site dynamics. When the chain asymptotically satisfies the $\ell_p$-Dobrushin's condition, specifically, when the Dobrushin's influence matrix has constantly bounded $\ell_p$-induced operator norm for an arbitrary $p\in[1,\infty]$, the parallel simulation of $N$ steps of single-site updates succeeds within $O\left({N}/{n}+\log n\right)$ depth of parallel computing using $\tilde{O}(m)$ processors, where $n$ is the number of sites and $m$ is the size of graphical model. Since the Dobrushin's condition is almost always satisfied asymptotically by mixing chains, this parallel simulation algorithm essentially transforms single-site dynamics with optimal $O(n\log n)$ mixing time to RNC algorithms for sampling. In particular we obtain RNC samplers, for the Ising models on general graphs in the uniqueness regime, and for satisfying solutions of CNF formulas in a local lemma regime. With non-adaptive simulated annealing, these RNC samplers can be transformed routinely to RNC algorithms for approximate counting. A key step in our parallel simulation algorithm, is a so-called "universal coupling" procedure, which tries to simultaneously couple all distributions over the same sample space. We construct such a universal coupling, that for every pair of distributions the coupled probability is at least their Jaccard similarity. We also prove this is optimal in the worst case. The universal coupling and its applications are of independent interests.
We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon $T$ with fixed and known cost matrices $Q,R$, but unknown and non-stationary dynamics $\{A_t, B_t\}$. The sequence of dynamics matrices can be arbitrary, but with a total variation, $V_T$, assumed to be $o(T)$ and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all $t$, we present an algorithm that achieves the optimal dynamic regret of $\tilde{\mathcal{O}}\left(V_T^{2/5}T^{3/5}\right)$. With piece-wise constant dynamics, our algorithm achieves the optimal regret of $\tilde{\mathcal{O}}(\sqrt{ST})$ where $S$ is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.
Consider the following social choice problem. Suppose we have a set of $n$ voters and $m$ candidates that lie in a metric space. The goal is to design a mechanism to choose a candidate whose average distance to the voters is as small as possible. However, the mechanism does not get direct access to the metric space. Instead, it gets each voter's ordinal ranking of the candidates by distance. Given only this partial information, what is the smallest worst-case approximation ratio (known as the distortion) that a mechanism can guarantee? A simple example shows that no deterministic mechanism can guarantee distortion better than $3$, and no randomized mechanism can guarantee distortion better than $2$. It has been conjectured that both of these lower bounds are optimal, and recently, Gkatzelis, Halpern, and Shah proved this conjecture for deterministic mechanisms. We disprove the conjecture for randomized mechanisms for $m \geq 3$ by constructing elections for which no randomized mechanism can guarantee distortion better than $2.0261$ for $m = 3$, $2.0496$ for $m = 4$, up to $2.1126$ as $m \to \infty$. We obtain our lower bounds by identifying a class of simple metrics that appear to capture much of the hardness of the problem, and we show that any randomized mechanism must have high distortion on one of these metrics. We provide a nearly matching upper bound for this restricted class of metrics as well. Finally, we conjecture that these bounds give the optimal distortion for every $m$, and provide a proof for $m = 3$, thereby resolving that case.
We study dynamic algorithms for the problem of maximizing a monotone submodular function over a stream of $n$ insertions and deletions. We show that any algorithm that maintains a $(0.5+\epsilon)$-approximate solution under a cardinality constraint, for any constant $\epsilon>0$, must have an amortized query complexity that is $\mathit{polynomial}$ in $n$. Moreover, a linear amortized query complexity is needed in order to maintain a $0.584$-approximate solution. This is in sharp contrast with recent dynamic algorithms of [LMNF+20, Mon20] that achieve $(0.5-\epsilon)$-approximation with a $\mathsf{poly}\log(n)$ amortized query complexity. On the positive side, when the stream is insertion-only, we present efficient algorithms for the problem under a cardinality constraint and under a matroid constraint with approximation guarantee $1-1/e-\epsilon$ and amortized query complexities $\smash{O(\log (k/\epsilon)/\epsilon^2)}$ and $\smash{k^{\tilde{O}(1/\epsilon^2)}\log n}$, respectively, where $k$ denotes the cardinality parameter or the rank of the matroid.
The common way to optimize auction and pricing systems is to set aside a small fraction of the traffic to run experiments. This leads to the question: how can we learn the most with the smallest amount of data? For truthful auctions, this is the \emph{sample complexity} problem. For posted price auctions, we no longer have access to samples. Instead, the algorithm is allowed to choose a price $p_t$; then for a fresh sample $v_t \sim \mathcal{D}$ we learn the sign $s_t = sign(p_t - v_t) \in \{-1,+1\}$. How many pricing queries are needed to estimate a given parameter of the underlying distribution? We give tight upper and lower bounds on the number of pricing queries required to find an approximately optimal reserve price for general, regular and MHR distributions. Interestingly, for regular distributions, the pricing query and sample complexities match. But for general and MHR distributions, we show a strict separation between them. All known results on sample complexity for revenue optimization follow from a variant of using the optimal reserve price of the empirical distribution. In the pricing query complexity setting, we show that learning the entire distribution within an error of $\epsilon$ in Levy distance requires strictly more pricing queries than to estimate the reserve. Instead, our algorithm uses a new property we identify called \emph{relative flatness} to quickly zoom into the right region of the distribution to get the optimal pricing query complexity.
This paper considers how to obtain MCMC quantitative convergence bounds which can be translated into tight complexity bounds in high-dimensional {settings}. We propose a modified drift-and-minorization approach, which establishes generalized drift conditions defined in subsets of the state space. The subsets are called the "large sets", and are chosen to rule out some "bad" states which have poor drift property when the dimension of the state space gets large. Using the "large sets" together with a "fitted family of drift functions", a quantitative bound can be obtained which can be translated into a tight complexity bound. As a demonstration, we analyze several Gibbs samplers and obtain complexity upper bounds for the mixing time. In particular, for one example of Gibbs sampler which is related to the James--Stein estimator, we show that the number of iterations required for the Gibbs sampler to converge is constant under certain conditions on the observed data and the initial state. It is our hope that this modified drift-and-minorization approach can be employed in many other specific examples to obtain complexity bounds for high-dimensional Markov chains.
The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.