In this paper, we consider the problem of black-box optimization using Gaussian Process (GP) bandit optimization with a small number of batches. Assuming the unknown function has a low norm in the Reproducing Kernel Hilbert Space (RKHS), we introduce a batch algorithm inspired by batched finite-arm bandit algorithms, and show that it achieves the cumulative regret upper bound $O^\ast(\sqrt{T\gamma_T})$ using $O(\log\log T)$ batches within time horizon $T$, where the $O^\ast(\cdot)$ notation hides dimension-independent logarithmic factors and $\gamma_T$ is the maximum information gain associated with the kernel. This bound is near-optimal for several kernels of interest and improves on the typical $O^\ast(\sqrt{T}\gamma_T)$ bound, and our approach is arguably the simplest among algorithms attaining this improvement. In addition, in the case of a constant number of batches (not depending on $T$), we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches, focusing on the squared exponential and Mat\'ern kernels. The algorithmic upper bounds are shown to be nearly minimax optimal via analogous algorithm-independent lower bounds.
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.
In the past a few years, many interesting inapproximability results have been obtained from the parameterized perspective. This article surveys some of such results, with a focus on $k$-Clique, $k$-SetCover, and other related problems.
We introduce a procedure for conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for statistical learning. On standard examples, this bound scales as $d/n$ with $d$ the model dimension and $n$ the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal $\log n$ factors and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of $O((d + B^2R^2)/n)$, where $R$ bounds the norm of features and $B$ that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than $\min({B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$ in general. This provides a more practical alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly addressing a question raised by Foster et al. (2018).
We prove new bounds on the distributed fractional coloring problem in the LOCAL model. Fractional $c$-colorings can be understood as multicolorings as follows. For some natural numbers $p$ and $q$ such that $p/q\leq c$, each node $v$ is assigned a set of at least $q$ colors from $\{1,\dots,p\}$ such that adjacent nodes are assigned disjoint sets of colors. The minimum $c$ for which a fractional $c$-coloring of a graph $G$ exists is called the fractional chromatic number $\chi_f(G)$ of $G$. Recently, [Bousquet, Esperet, and Pirot; SIROCCO '21] showed that for any constant $\epsilon>0$, a fractional $(\Delta+\epsilon)$-coloring can be computed in $\Delta^{O(\Delta)} + O(\Delta\cdot\log^* n)$ rounds. We show that such a coloring can be computed in only $O(\log^2 \Delta)$ rounds, without any dependency on $n$. We further show that in $O\big(\frac{\log n}{\epsilon}\big)$ rounds, it is possible to compute a fractional $(1+\epsilon)\chi_f(G)$-coloring, even if the fractional chromatic number $\chi_f(G)$ is not known. That is, this problem can be approximated arbitrarily well by an efficient algorithm in the LOCAL model. For the standard coloring problem, it is only known that an $O\big(\frac{\log n}{\log\log n}\big)$-approximation can be computed in polylogarithmic time in the LOCAL model. We also show that our distributed fractional coloring approximation algorithm is best possible. We show that in trees, which have fractional chromatic number $2$, computing a fractional $(2+\epsilon)$-coloring requires at least $\Omega\big(\frac{\log n}{\epsilon}\big)$ rounds. We finally study fractional colorings of regular grids. In [Bousquet, Esperet, and Pirot; SIROCCO '21], it is shown that in regular grids of bounded dimension, a fractional $(2+\epsilon)$-coloring can be computed in time $O(\log^* n)$. We show that such a coloring can even be computed in $O(1)$ rounds in the LOCAL model.
Stochastic differential equations projected onto manifolds occur widely in physics, chemistry, biology, engineering, nanotechnology and optimization theory. In some problems one can use an intrinsic coordinate system on the manifold, but this is often computationally impractical. Numerical projections are preferable in many cases. We derive an algorithm to solve these, using adiabatic elimination and a constraining potential. We also review earlier proposed algorithms. Our hybrid midpoint projection algorithm uses a midpoint projection on a tangent manifold, combined with a normal projection to satisfy the constraints. We show from numerical examples on spheroidal and hyperboloidal surfaces that this has greatly reduced errors compared to earlier methods using either a hybrid Euler with tangential and normal projections or purely tangential derivative methods. Our technique can handle multiple constraints. This allows, for example, the treatment of manifolds that embody several conserved quantities. The resulting algorithm is accurate, relatively simple to implement and efficient.
This paper is concerned with the asymptotic behavior in $\beta$-H\"older spaces and under $L^p$ losses of a Dirichlet kernel density estimator introduced by Aitchison & Lauder (1985) and studied theoretically by Ouimet & Tolosana-Delgado (2021). It is shown that the estimator is minimax when $p \in [1, 3)$ and $\beta \in (0, 2]$, and that it is never minimax when $p \in [4, \infty)$ or $\beta \in (2, \infty)$. These results rectify in a minor way and, more importantly, extend to all dimensions those already reported in the univariate case by Bertin & Klutchnikoff (2011).
Multi-marginal optimal transport (MOT) is a generalization of optimal transport to multiple marginals. Optimal transport has evolved into an important tool in many machine learning applications, and its multi-marginal extension opens up for addressing new challenges in the field of machine learning. However, the usage of MOT has been largely impeded by its computational complexity which scales exponentially in the number of marginals. Fortunately, in many applications, such as barycenter or interpolation problems, the cost function adheres to structures, which has recently been exploited for developing efficient computational methods. In this work we derive computational bounds for these methods. With $m$ marginal distributions supported on $n$ points, we provide a $ \mathcal{\tilde O}(d(G)m n^2\epsilon^{-2})$ bound for a $\epsilon$-accuracy when the problem is associated with a tree with diameter $d(G)$. For the special case of the Wasserstein barycenter problem, which corresponds to a star-shaped tree, our bound is in alignment with the existing complexity bound for it.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.