The accelerated weight histogram (AWH) algorithm is an iterative extended ensemble algorithm, developed for statistical physics and computational biology applications. It is used to estimate free energy differences and expectations with respect to Gibbs measures. The AWH algorithm is based on iterative updates of a design parameter, which is closely related to the free energy, obtained by matching a weight histogram with a specified target distribution. The weight histogram is constructed from samples of a Markov chain on the product of the state space and parameter space. In this paper almost sure convergence of the AWH algorithm is proved, for estimating free energy differences as well as estimating expectations with adaptive ergodic averages. The proof is based on identifying the AWH algorithm as a stochastic approximation and studying the properties of the associated limit ordinary differential equation.
To gain a better theoretical understanding of how evolutionary algorithms (EAs) cope with plateaus of constant fitness, we propose the $n$-dimensional Plateau$_k$ function as natural benchmark and analyze how different variants of the $(1 + 1)$ EA optimize it. The Plateau$_k$ function has a plateau of second-best fitness in a ball of radius $k$ around the optimum. As evolutionary algorithm, we regard the $(1 + 1)$ EA using an arbitrary unbiased mutation operator. Denoting by $\alpha$ the random number of bits flipped in an application of this operator and assuming that $\Pr[\alpha = 1]$ has at least some small sub-constant value, we show the surprising result that for all constant $k \ge 2$, the runtime $T$ follows a distribution close to the geometric one with success probability equal to the probability to flip between $1$ and $k$ bits divided by the size of the plateau. Consequently, the expected runtime is the inverse of this number, and thus only depends on the probability to flip between $1$ and $k$ bits, but not on other characteristics of the mutation operator. Our result also implies that the optimal mutation rate for standard bit mutation here is approximately $k/(en)$. Our main analysis tool is a combined analysis of the Markov chains on the search point space and on the Hamming level space, an approach that promises to be useful also for other plateau problems.
The well-known stochastic SIS model characterized by highly nonlinear in epidemiology has a unique positive solution taking values in a bounded domain with a series of dynamical behaviors. However, the approximation methods to maintain the positivity and long-time behaviors for the stochastic SIS model, while very important, are also lacking. In this paper, based on a logarithmic transformation, we propose a novel explicit numerical method for a stochastic SIS epidemic model whose coefficients violate the global monotonicity condition, which can preserve the positivity of the original stochastic SIS model. And we show the strong convergence of the numerical method and derive that the rate of convergence is of order one. Moreover, the extinction of the exact solution of stochastic SIS model is reproduced. Some numerical experiments are given to illustrate the theoretical results and testify the efficiency of our algorithm.
Solutions of certain partial differential equations (PDEs) are often represented by the steepest descent curves of corresponding functionals. Minimizing movement scheme was developed in order to study such curves in metric spaces. Especially, Jordan-Kinderlehrer-Otto studied the Fokker-Planck equation in this way with respect to the Wasserstein metric space. In this paper, we propose a deep learning-based minimizing movement scheme for approximating the solutions of PDEs. The proposed method is highly scalable for high-dimensional problems as it is free of mesh generation. We demonstrate through various kinds of numerical examples that the proposed method accurately approximates the solutions of PDEs by finding the steepest descent direction of a functional even in high dimensions.
The rate of convergence of weighted kernel herding (WKH) and sequential Bayesian quadrature (SBQ), two kernel-based sampling algorithms for estimating integrals with respect to some target probability measure, is investigated. Under verifiable conditions on the chosen kernel and target measure, we establish a near-geometric rate of convergence for target measures that are nearly atomic. Furthermore, we show these algorithms perform comparably to the theoretical best possible sampling algorithm under the maximum mean discrepancy. An analysis is also conducted in a distributed setting. Our theoretical developments are supported by empirical observations on simulated data as well as a real world application.
This paper presents new \emph{variance-aware} confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs). With the new confidence sets, we obtain the follow regret bounds: For linear bandits, we obtain an $\tilde{O}(poly(d)\sqrt{1 + \sum_{k=1}^{K}\sigma_k^2})$ data-dependent regret bound, where $d$ is the feature dimension, $K$ is the number of rounds, and $\sigma_k^2$ is the \emph{unknown} variance of the reward at the $k$-th round. This is the first regret bound that only scales with the variance and the dimension but \emph{no explicit polynomial dependency on $K$}. When variances are small, this bound can be significantly smaller than the $\tilde{\Theta}\left(d\sqrt{K}\right)$ worst-case regret bound. For linear mixture MDPs, we obtain an $\tilde{O}(poly(d, \log H)\sqrt{K})$ regret bound, where $d$ is the number of base models, $K$ is the number of episodes, and $H$ is the planning horizon. This is the first regret bound that only scales \emph{logarithmically} with $H$ in the reinforcement learning with linear function approximation setting, thus \emph{exponentially improving} existing results, and resolving an open problem in \citep{zhou2020nearly}. We develop three technical ideas that may be of independent interest: 1) applications of the peeling technique to both the input norm and the variance magnitude, 2) a recursion-based estimator for the variance, and 3) a new convex potential lemma that generalizes the seminal elliptical potential lemma.
We propose the first non-trivial generic decoding algorithm for codes in the sum-rank metric. The new method combines ideas of well-known generic decoders in the Hamming and rank metric. For the same code parameters and number of errors, the new generic decoder has a larger expected complexity than the known generic decoders for the Hamming metric and smaller than the known rank-metric decoders. Furthermore, we give a formal hardness reduction, providing evidence that generic sum-rank decoding is computationally hard. As a by-product of the above, we solve some fundamental coding problems in the sum-rank metric: we give an algorithm to compute the exact size of a sphere of a given sum-rank radius, and also give an upper bound as a closed formula; and we study erasure decoding with respect to two different notions of support.
An empirical measure that results from the nearest neighbors to a given point - the nearest neighbor measure - is introduced and studied as a central statistical quantity. First, the resulting empirical process is shown to satisfy a uniform central limit theorem under a (local) bracketing entropy condition on the underlying class of functions (reflecting the localizing nature of nearest neighbor algorithm). Second a uniform non-asymptotic bound is established under a well-known condition, often refereed to as Vapnik-Chervonenkis, on the uniform entropy numbers.
Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) is one of the most popular algorithms in distributed machine learning. However, its convergence properties for these complicated nonconvex problems is still largely unknown, because of the current technical limit. Therefore, in this paper, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problem - streaming PCA, which helps us to understand Aync-MSGD better even for more general problems. Specifically, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA by diffusion approximation. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.
Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.
In this paper we study the frequentist convergence rate for the Latent Dirichlet Allocation (Blei et al., 2003) topic models. We show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in Wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of Anandkumar et al. (2012, 2014) from an information-theoretical perspective. We also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.