The Wasserstein barycenter has been widely studied in various fields, including natural language processing, and computer vision. However, it requires a high computational cost to solve the Wasserstein barycenter problem because the computation of the Wasserstein distance requires a quadratic time with respect to the number of supports. By contrast, the Wasserstein distance on a tree, called the tree-Wasserstein distance, can be computed in linear time and allows for the fast comparison of a large number of distributions. In this study, we propose a barycenter under the tree-Wasserstein distance, called the fixed support tree-Wasserstein barycenter (FS-TWB) and its extension, called the fixed support tree-sliced Wasserstein barycenter (FS-TSWB). More specifically, we first show that the FS-TWB and FS-TSWB problems are convex optimization problems and can be solved by using the projected subgradient descent. Moreover, we propose a more efficient algorithm to compute the subgradient and objective function value by using the properties of tree-Wasserstein barycenter problems. Through real-world experiments, we show that, by using the proposed algorithm, the FS-TWB and FS-TSWB can be solved two orders of magnitude faster than the original Wasserstein barycenter.
We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying Markov random processes parameterized by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated "faster" (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, convexity, the Polyak-Lojasiewicz condition, and general non-convexity. We apply our framework to two problems in control and reinforcement learning. First, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of O(k^(-2/5)), which recovers the best known rate derived specifically for this problem. Second, we study an online actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of O(k^(-2/3)) is achieved. This is the first time such a result is known in the literature. Finally, we support our theoretical analysis with numerical simulations where the convergence rates are visualized.
Stein variational gradient descent (SVGD) is a general-purpose optimization-based sampling algorithm that has recently exploded in popularity, but is limited by two issues: it is known to produce biased samples, and it can be slow to converge on complicated distributions. A recently proposed stochastic variant of SVGD (sSVGD) addresses the first issue, producing unbiased samples by incorporating a special noise into the SVGD dynamics such that asymptotic convergence is guaranteed. Meanwhile, Stein variational Newton (SVN), a Newton-like extension of SVGD, dramatically accelerates the convergence of SVGD by incorporating Hessian information into the dynamics, but also produces biased samples. In this paper we derive, and provide a practical implementation of, a stochastic variant of SVN (sSVN) which is both asymptotically correct and converges rapidly. We demonstrate the effectiveness of our algorithm on a difficult class of test problems -- the Hybrid Rosenbrock density -- and show that sSVN converges using three orders of magnitude fewer gradient evaluations of the log likelihood than its stochastic SVGD counterpart. Our results show that sSVN is a promising approach to accelerating high-precision Bayesian inference tasks with modest-dimension, $d\sim\mathcal{O}(10)$.
We propose a decomposition method for the spectral peaks in an observed frequency spectrum, which is efficiently acquired by utilizing the Fast Fourier Transform. In contrast to the traditional methods of waveform fitting on the spectrum, we optimize the problem from a more robust perspective. We model the peaks in spectrum as pseudo-symmetric functions, where the only constraint is a nonincreasing behavior around a central frequency when the distance increases. Our approach is more robust against arbitrary distortion, interference and noise on the spectrum that may be caused by an observation system. The time complexity of our method is linear, i.e., $O(N)$ per extracted spectral peak. Moreover, the decomposed spectral peaks show a pseudo-orthogonal behavior, where they conform to a power preserving equality.
Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.
We study the decentralized consensus and stochastic optimization problems with compressed communications over static directed graphs. We propose an iterative gradient-based algorithm that compresses messages according to a desired compression ratio. The proposed method provably reduces the communication overhead on the network at every communication round. Contrary to existing literature, we allow for arbitrary compression ratios in the communicated messages. We show a linear convergence rate for the proposed method on the consensus problem. Moreover, we provide explicit convergence rates for decentralized stochastic optimization problems on smooth functions that are either (i) strongly convex, (ii) convex, or (iii) non-convex. Finally, we provide numerical experiments to illustrate convergence under arbitrary compression ratios and the communication efficiency of our algorithm.
In this paper, we propose a PAC-Bayesian \textit{a posteriori} parameter selection scheme for adaptive regularized regression in Hilbert scales under general, unknown source conditions. We demonstrate that our approach is adaptive to misspecification, and achieves the optimal learning rate under subgaussian noise. Unlike existing parameter selection schemes, the computational complexity of our approach is independent of sample size. We derive minimax adaptive rates for a new, broad class of Tikhonov-regularized learning problems under general, misspecified source conditions, that notably do not require any conventional a priori assumptions on kernel eigendecay. Using the theory of interpolation, we demonstrate that the spectrum of the Mercer operator can be inferred in the presence of "tight" $L^{\infty}$ embeddings of suitable Hilbert scales. Finally, we prove, that under a $\Delta_2$ condition on the smoothness index functions, our PAC-Bayesian scheme can indeed achieve minimax rates. We discuss applications of our approach to statistical inverse problems and oracle-efficient contextual bandit algorithms.
We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPI-GD) recently proposed for the approximate solution of empirical risk minimization problems (ERM). We focus on loss functions that are strongly convex and smooth with condition number $\sigma$. We additionally assume the loss function is $\eta$-H\"older continuous with respect to the data. The oracle complexity of LPI-GD is $\tilde{O}\left(\sigma m^d \log(1/\varepsilon)\right)$ for a desired accuracy $\varepsilon$, where $d$ is the dimension of the parameter space, and $m$ is the cardinality of an approximation grid. The factor $m^d$ can be shown to scale as $O((1/\varepsilon)^{d/2\eta})$. LPI-GD has been shown to have better oracle complexity than gradient descent (GD) and stochastic gradient descent (SGD) for certain parameter regimes. We propose two accelerated methods for the ERM problem based on LPI-GD and show an oracle complexity of $\tilde{O}\left(\sqrt{\sigma} m^d \log(1/\varepsilon)\right)$. Moreover, we provide the first empirical study on local polynomial interpolation-based gradient methods and corroborate that LPI-GD has better performance than GD and SGD in some scenarios, and the proposed methods achieve acceleration.
Vector Perturbation Precoding (VPP) can speed up downlink data transmissions in Large and Massive Multi-User MIMO systems but is known to be NP-hard. While there are several algorithms in the literature for VPP under total power constraint, they are not applicable for VPP under per-antenna power constraint. This paper proposes a novel, parallel tree search algorithm for VPP under per-antenna power constraint, called \emph{\textbf{TreeStep}}, to find good quality solutions to the VPP problem with practical computational complexity. We show that our method can provide huge performance gain over simple linear precoding like Regularised Zero Forcing. We evaluate TreeStep for several large MIMO~($16\times16$ and $24\times24$) and massive MIMO~($16\times32$ and $24\times 48$) and demonstrate that TreeStep outperforms the popular polynomial-time VPP algorithm, the Fixed Complexity Sphere Encoder, by achieving the extremely low BER of $10^{-6}$ at a much lower SNR.
Consider the problem of training robustly capable agents. One approach is to generate a diverse collection of agent polices. Training can then be viewed as a quality diversity (QD) optimization problem, where we search for a collection of performant policies that are diverse with respect to quantified behavior. Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available. However, agent policies typically assume that the environment is not differentiable. To apply DQD algorithms to training agent policies, we must approximate gradients for performance and behavior. We propose two variants of the current state-of-the-art DQD algorithm that compute gradients via approximation methods common in reinforcement learning (RL). We evaluate our approach on four simulated locomotion tasks. One variant achieves results comparable to the current state-of-the-art in combining QD and RL, while the other performs comparably in two locomotion tasks. These results provide insight into the limitations of current DQD algorithms in domains where gradients must be approximated. Source code is available at //github.com/icaros-usc/dqd-rl
We propose a simple yet powerful extension of Bayesian Additive Regression Trees which we name Hierarchical Embedded BART (HE-BART). The model allows for random effects to be included at the terminal node level of a set of regression trees, making HE-BART a non-parametric alternative to mixed effects models which avoids the need for the user to specify the structure of the random effects in the model, whilst maintaining the prediction and uncertainty calibration properties of standard BART. Using simulated and real-world examples, we demonstrate that this new extension yields superior predictions for many of the standard mixed effects models' example data sets, and yet still provides consistent estimates of the random effect variances. In a future version of this paper, we outline its use in larger, more advanced data sets and structures.