We study the complexity of optimizing highly smooth convex functions. For a positive integer $p$, we want to find an $\epsilon$-approximate minimum of a convex function $f$, given oracle access to the function and its first $p$ derivatives, assuming that the $p$th derivative of $f$ is Lipschitz. Recently, three independent research groups (Jiang et al., PLMR 2019; Gasnikov et al., PLMR 2019; Bubeck et al., PLMR 2019) developed a new algorithm that solves this problem with $\tilde{O}(1/\epsilon^{\frac{2}{3p+1}})$ oracle calls for constant $p$. This is known to be optimal (up to log factors) for deterministic algorithms, but known lower bounds for randomized algorithms do not match this bound. We prove a new lower bound that matches this bound (up to log factors), and holds not only for randomized algorithms, but also for quantum algorithms.
This paper investigates why it is beneficial, when solving a problem, to search in the neighbourhood of a current solution. The paper identifies properties of problems and neighbourhoods that support two novel proofs that neighbourhood search is beneficial over blind search. These are: firstly a proof that search within the neighbourhood is more likely to find an improving solution in a single search step than blind search; and secondly a proof that a local improvement, using a sequence of neighbourhood search steps, is likely to achieve a greater improvement than a sequence of blind search steps. To explore the practical impact of these properties, a range of problem sets and neighbourhoods are generated, where these properties are satisfied to different degrees. Experiments reveal that the benefits of neighbourhood search vary dramatically in consequence. Random problems of a classical combinatorial optimisation problem are analysed, in order to demonstrate that the underlying theory is reflected in practice.
In this article, we construct and analyse an explicit numerical splitting method for a class of semi-linear stochastic differential equations (SDEs) with additive noise, where the drift is allowed to grow polynomially and satisfies a global one-sided Lipschitz condition. The method is proved to be mean-square convergent of order 1 and to preserve important structural properties of the SDE. First, it is hypoelliptic in every iteration step. Second, it is geometrically ergodic and has an asymptotically bounded second moment. Third, it preserves oscillatory dynamics, such as amplitudes, frequencies and phases of oscillations, even for large time steps. Our results are illustrated on the stochastic FitzHugh-Nagumo model and compared with known mean-square convergent tamed/truncated variants of the Euler-Maruyama method. The capability of the proposed splitting method to preserve the aforementioned properties may make it applicable within different statistical inference procedures. In contrast, known Euler-Maruyama type methods commonly fail in preserving such properties, yielding ill-conditioned likelihood-based estimation tools or computationally infeasible simulation-based inference algorithms.
Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.
Recently a machine learning approach to Monte-Carlo simulations called Neural Markov Chain Monte-Carlo (NMCMC) is gaining traction. In its most popular form it uses the neural networks to construct normalizing flows which are then trained to approximate the desired target distribution. As this distribution is usually defined via a Hamiltonian or action, the standard learning algorithm requires estimation of the action gradient with respect to the fields. In this contribution we present another gradient estimator (and the corresponding [PyTorch implementation) that avoids this calculation, thus potentially speeding up training for models with more complicated actions. We also study the statistical properties of several gradient estimators and show that our formulation leads to better training results.
We present substantially generalized and improved quantum algorithms over prior work for inhomogeneous linear and nonlinear ordinary differential equations (ODE). In Berry et al., (2017), a quantum algorithm for a certain class of linear ODEs is given, where the matrix involved needs to be diagonalizable. The quantum algorithm for linear ODEs presented here extends to many classes of non-diagonalizable matrices. The algorithm here can also be exponentially faster for certain classes of diagonalizable matrices. Our linear ODE algorithm is then applied to nonlinear differential equations using Carleman linearization (an approach taken recently by us in Liu et al., (2021)). The improvement over that result is two-fold. First, we obtain an exponentially better dependence on error. This kind of logarithmic dependence on error has also been achieved by Xue et al., (2021), but only for homogeneous nonlinear equations. Second, the present algorithm can handle any sparse, invertible matrix (that models dissipation) if it has a negative log-norm (including non-diagonalizable matrices), whereas Liu et al., (2021) and Xue et al., (2021) additionally require normality.
This paper studies Quasi Maximum Likelihood estimation of Dynamic Factor Models for large panels of time series. Specifically, we consider the case in which the autocorrelation of the factors is explicitly accounted for, and therefore the model has a state-space form. Estimation of the factors and their loadings is implemented through the Expectation Maximization (EM) algorithm, jointly with the Kalman smoother.~We prove that as both the dimension of the panel $n$ and the sample size $T$ diverge to infinity, up to logarithmic terms: (i) the estimated loadings are $\sqrt T$-consistent and asymptotically normal if $\sqrt T/n\to 0$; (ii) the estimated factors are $\sqrt n$-consistent and asymptotically normal if $\sqrt n/T\to 0$; (iii) the estimated common component is $\min(\sqrt n,\sqrt T)$-consistent and asymptotically normal regardless of the relative rate of divergence of $n$ and $T$. Although the model is estimated as if the idiosyncratic terms were cross-sectionally and serially uncorrelated and normally distributed, we show that these mis-specifications do not affect consistency. Moreover, the estimated loadings are asymptotically as efficient as those obtained with the Principal Components estimator, while the estimated factors are more efficient if the idiosyncratic covariance is sparse enough.~We then propose robust estimators of the asymptotic covariances, which can be used to conduct inference on the loadings and to compute confidence intervals for the factors and common components. Finally, we study the performance of our estimators and we compare them with the traditional Principal Components approach through MonteCarlo simulations and analysis of US macroeconomic data.
Differential equations arising in many practical applications are characterized by multiple time scales. Multirate time integration seeks to solve them efficiently by discretizing each scale with a different, appropriate time step, while ensuring the overall accuracy and stability of the numerical solution. In a seminal paper Knoth and Wolke (APNUM, 1998) proposed a hybrid solution approach: discretize the slow component with an explicit Runge-Kutta method, and advance the fast component via a modified fast differential equation. The idea led to the development of multirate infinitesimal step (MIS) methods by Wensch et al. (BIT, 2009.)G\"{u}nther and Sandu (BIT, 2016) explained MIS schemes as a particular case of multirate General-structure Additive Runge-Kutta (MR-GARK) methods. The hybrid approach offers extreme flexibility in the choice of the numerical solution process for the fast component. This work constructs a family of multirate infinitesimal GARK schemes (MRI-GARK) that extends the hybrid dynamics approachin multiple ways. Order conditions theory and stability analyses are developed, and practical explicit and implicit methods of up to order four are constructed. Numerical results confirm the theoretical findings. We expect the new MRI-GARK family to be most useful for systems of equations with widely disparate time scales, where the fast process is dispersive, and where the influence of the fast component on the slow dynamics is weak.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
This work considers the problem of provably optimal reinforcement learning for episodic finite horizon MDPs, i.e. how an agent learns to maximize his/her long term reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP, $S$ is the number of states, $A$ is the number of actions, and $H$ is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve $\epsilon$-average regret for any constant $\epsilon$ is $O(SA)$, which is a number of samples that is far less than that required to learn any non-trivial estimate of the transition model (the transition model is specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as $Q$-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the $\Omega(\sqrt{HSAT})$ lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in $Q$-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the asymptotically optimal regret.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.