There has been substantial interest in developing Markov chain Monte Carlo algorithms based on piecewise-deterministic Markov processes. However existing algorithms can only be used if the target distribution of interest is differentiable everywhere. The key to adapting these algorithms so that they can sample from to densities with discontinuities is defining appropriate dynamics for the process when it hits a discontinuity. We present a simple condition for the transition of the process at a discontinuity which can be used to extend any existing sampler for smooth densities, and give specific choices for this transition which work with popular algorithms such as the Bouncy Particle Sampler, the Coordinate Sampler and the Zig-Zag Process. Our theoretical results extend and make rigorous arguments that have been presented previously, for instance constructing samplers for continuous densities restricted to a bounded domain, and we present a version of the Zig-Zag Process that can work in such a scenario. Our novel approach to deriving the invariant distribution of a piecewise-deterministic Markov process with boundaries may be of independent interest.
Hamilton and Moitra (2021) showed that, in certain regimes, it is not possible to accelerate Riemannian gradient descent in the hyperbolic plane if we restrict ourselves to algorithms which make queries in a (large) bounded domain and which receive gradients and function values corrupted by a (small) amount of noise. We show that acceleration remains unachievable for any deterministic algorithm which receives exact gradient and function-value information (unbounded queries, no noise). Our results hold for the classes of strongly and nonstrongly geodesically convex functions, and for a large class of Hadamard manifolds including hyperbolic spaces and the symmetric space $\mathrm{SL}(n) / \mathrm{SO}(n)$ of positive definite $n \times n$ matrices of determinant one. This cements a surprising gap between the complexity of convex optimization and geodesically convex optimization: for hyperbolic spaces, Riemannian gradient descent is optimal on the class of smooth and and strongly geodesically convex functions, in the regime where the condition number scales with the radius of the optimization domain. The key idea for proving the lower bound consists of perturbing the hard functions of Hamilton and Moitra (2021) with sums of bump functions chosen by a resisting oracle.
In the graphical calculus of planar string diagrams, equality is generated by exchange moves, which swap the heights of adjacent vertices. We show that left- and right-handed exchanges each give strongly normalizing rewrite strategies for connected string diagrams. We use this result to give a linear-time solution to the equivalence problem in the connected case, and a quadratic solution in the general case. We also give a stronger proof of the Joyal-Street coherence theorem, settling Selinger's conjecture on recumbent isotopy.
The present paper focuses on the problem of sampling from a given target distribution $\pi$ defined on some general state space. To this end, we introduce a novel class of non-reversible Markov chains, each chain being defined on an extended state space and having an invariant probability measure admitting $\pi$ as a marginal distribution. The proposed methodology is inspired by a new formulation of Kac's theorem and allows global and local dynamics to be smoothly combined. Under mild conditions, the corresponding Markov transition kernel can be shown to be irreducible and Harris recurrent. In addition, we establish that geometric ergodicity holds under appropriate conditions on the global and local dynamics. Finally, we illustrate numerically the use of the proposed method and its potential benefits in comparison to existing Markov chain Monte Carlo (MCMC) algorithms.
Due to the multi-linearity of tensors, most algorithms for tensor optimization problems are designed based on the block coordinate descent method. Such algorithms are widely employed by practitioners for their implementability and effectiveness. However, these algorithms usually suffer from the lack of theoretical guarantee of global convergence and analysis of convergence rate. In this paper, we propose a block coordinate descent type algorithm for the low rank partially orthogonal tensor approximation problem and analyse its convergence behaviour. To achieve this, we carefully investigate the variety of low rank partially orthogonal tensors and its geometric properties related to the parameter space, which enable us to locate KKT points of the concerned optimization problem. With the aid of these geometric properties, we prove without any assumption that: (1) Our algorithm converges globally to a KKT point; (2) For any given tensor, the algorithm exhibits an overall sublinear convergence with an explicit rate which is sharper than the usual $O(1/k)$ for first order methods in nonconvex optimization; {(3)} For a generic tensor, our algorithm converges $R$-linearly.
In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. We therefore consider a constrained MDP where the constraints are probabilistic. Since there is no straightforward way to optimize the policy with respect to the probabilistic constraint in a reinforcement learning framework, we propose an ergodic relaxation of the problem. The advantages of the proposed relaxation are threefold. (i) The safety guarantees are maintained in the case of episodic tasks and they are kept up to a given time horizon for continuing tasks. (ii) The constrained optimization problem despite its non-convexity has arbitrarily small duality gap if the parametrization of the policy is rich enough. (iii) The gradients of the Lagrangian associated with the safe-learning problem can be easily computed using standard policy gradient results and stochastic approximation tools. Leveraging these advantages, we establish that primal-dual algorithms are able to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.
Graph neural networks (GNNs) are typically applied to static graphs that are assumed to be known upfront. This static input structure is often informed purely by insight of the machine learning practitioner, and might not be optimal for the actual task the GNN is solving. In absence of reliable domain expertise, one might resort to inferring the latent graph structure, which is often difficult due to the vast search space of possible graphs. Here we introduce Pointer Graph Networks (PGNs) which augment sets or graphs with additional inferred edges for improved model expressivity. PGNs allow each node to dynamically point to another node, followed by message passing over these pointers. The sparsity of this adaptable graph structure makes learning tractable while still being sufficiently expressive to simulate complex algorithms. Critically, the pointing mechanism is directly supervised to model long-term sequences of operations on classical data structures, incorporating useful structural inductive biases from theoretical computer science. Qualitatively, we demonstrate that PGNs can learn parallelisable variants of pointer-based data structures, namely disjoint set unions and link/cut trees. PGNs generalise out-of-distribution to 5x larger test inputs on dynamic graph connectivity tasks, outperforming unrestricted GNNs and Deep Sets.
Importance sampling is one of the most widely used variance reduction strategies in Monte Carlo rendering. In this paper, we propose a novel importance sampling technique that uses a neural network to learn how to sample from a desired density represented by a set of samples. Our approach considers an existing Monte Carlo rendering algorithm as a black box. During a scene-dependent training phase, we learn to generate samples with a desired density in the primary sample space of the rendering algorithm using maximum likelihood estimation. We leverage a recent neural network architecture that was designed to represent real-valued non-volume preserving ('Real NVP') transformations in high dimensional spaces. We use Real NVP to non-linearly warp primary sample space and obtain desired densities. In addition, Real NVP efficiently computes the determinant of the Jacobian of the warp, which is required to implement the change of integration variables implied by the warp. A main advantage of our approach is that it is agnostic of underlying light transport effects, and can be combined with many existing rendering techniques by treating them as a black box. We show that our approach leads to effective variance reduction in several practical scenarios.
Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space, such as the simplex, the time-discretisation error can dominate when we are near the boundary of the space. We demonstrate that while current SGMCMC methods for the simplex perform well in certain cases, they struggle with sparse simplex spaces; when many of the components are close to zero. However, most popular large-scale applications of Bayesian inference on simplex spaces, such as network or topic models, are sparse. We argue that this poor performance is due to the biases of SGMCMC caused by the discretization error. To get around this, we propose the stochastic CIR process, which removes all discretization error and we prove that samples from the stochastic CIR process are asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.