Random forests remain among the most popular off-the-shelf supervised learning algorithms. Despite their well-documented empirical success, however, until recently, few theoretical results were available to describe their performance and behavior. In this work we push beyond recent work on consistency and asymptotic normality by establishing rates of convergence for random forests and other supervised learning ensembles. We develop the notion of generalized U-statistics and show that within this framework, random forest predictions can potentially remain asymptotically normal for larger subsample sizes than previously established. We also provide Berry-Esseen bounds in order to quantify the rate at which this convergence occurs, making explicit the roles of the subsample size and the number of trees in determining the distribution of random forest predictions.
Markov Chain Monte Carlo (MCMC) is one of the most powerful methods to sample from a given probability distribution, of which the Metropolis Adjusted Langevin Algorithm (MALA) is a variant wherein the gradient of the distribution is used towards faster convergence. However, being set up in the Euclidean framework, MALA might perform poorly in higher dimensional problems or in those involving anisotropic densities as the underlying non-Euclidean aspects of the geometry of the sample space remain unaccounted for. We make use of concepts from differential geometry and stochastic calculus on Riemannian manifolds to geometrically adapt a stochastic differential equation with a non-trivial drift term. This adaptation is also referred to as a stochastic development. We apply this method specifically to the Langevin diffusion equation and arrive at a geometrically adapted Langevin dynamics. This new approach far outperforms MALA, certain manifold variants of MALA, and other approaches such as Hamiltonian Monte Carlo (HMC), its adaptive variant the no-U-turn sampler (NUTS) implemented in Stan, especially as the dimension of the problem increases where often GALA is actually the only successful method. This is evidenced through several numerical examples that include parameter estimation of a broad class of probability distributions and a logistic regression problem.
This paper considers the strong error analysis of the Euler and fast Euler methods for nonlinear overdamped generalized Langevin equations driven by the fractional noise. The main difficulty lies in handling the interaction between the fractional Brownian motion and the singular kernel, which is overcome by means of the Malliavin calculus and fine estimates of several multiple singular integrals. Consequently, these two methods are proved to be strongly convergent with order nearly $\min\{2(H+\alpha-1), \alpha\}$, where $H \in (1/2,1)$ and $\alpha\in(1-H,1)$ respectively characterize the singularity levels of fractional noises and singular kernels in the underlying equation. This result improves the existing convergence order $H+\alpha-1$ of Euler methods for the nonlinear case, and gives a positive answer to the open problem raised in [4]. As an application of the theoretical findings, we further investigate the complexity of the multilevel Monte Carlo simulation based on the fast Euler method, which turns out to behave better performance than the standard Monte Carlo simulation when computing the expectation of functionals of the considered equation.
The XGBoost method has many advantages and is especially suitable for statistical analysis of big data, but its loss function is limited to convex functions. In many specific applications, a nonconvex loss function would be preferable. In this paper, I propose a generalized XGBoost method, which requires weaker loss function constraint and involves more general loss functions, including convex loss functions and some non-convex loss functions. Furthermore, this generalized XGBoost method is extended to multivariate loss function to form a more generalized XGBoost method. This method is a multiobjective parameter regularized tree boosting method, which can model multiple parameters in most of the frequently-used parametric probability distributions to be fitted by predictor variables. Meanwhile, the related algorithms and some examples in non-life insurance pricing are given.
We consider infinite-horizon discounted Markov decision problems with finite state and action spaces. We show that with direct parametrization in the policy space, the weighted value function, although non-convex in general, is both quasi-convex and quasi-concave. While quasi-convexity helps explain the convergence of policy gradient methods to global optima, quasi-concavity hints at their convergence guarantees using arbitrarily large step sizes that are not dictated by the Lipschitz constant charactering smoothness of the value function. In particular, we show that when using geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. In addition, we develop a theory of weak gradient-mapping dominance and use it to prove sharper sublinear convergence rate of the projected policy gradient method. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.
Mirror descent (MD) is a powerful first-order optimization technique that subsumes several optimization algorithms including gradient descent (GD). In this work, we develop a semi-definite programming (SDP) framework to analyze the convergence rate of MD in centralized and distributed settings under both strongly convex and non-strongly convex assumptions. We view MD with a dynamical system lens and leverage quadratic constraints (QCs) to provide explicit convergence rates based on Lyapunov stability. For centralized MD under strongly convex assumption, we develop a SDP that certifies exponential convergence rates. We prove that the SDP always has a feasible solution that recovers the optimal GD rate as a special case. We complement our analysis by providing the $O(1/k)$ convergence rate for convex problems. Next, we analyze the convergence of distributed MD and characterize the rate using SDP. To the best of our knowledge, the numerical rate of distributed MD has not been previously reported in the literature. We further prove an $O(1/k)$ convergence rate for distributed MD in the convex setting. Our numerical experiments on strongly convex problems indicate that our framework certifies superior convergence rates compared to the existing rates for distributed GD.
Monads are commonplace in computer science, and can be composed using Beck's distributive laws. Unfortunately, finding distributive laws can be extremely difficult and error-prone. The literature contains some general principles for constructing distributive laws. However, until now there have been no such techniques for establishing when no distributive law exists. We present three families of theorems for showing when there can be no distributive law between two monads. The first widely generalizes a counterexample attributed to Plotkin. It covers all the previous known no-go results for specific pairs of monads, and includes many new results. The second and third families are entirely novel, encompassing various new practical situations. For example, they negatively resolve the open question of whether the list monad distributes over itself, reveal a previously unobserved error in the literature, and confirm a conjecture made by Beck himself in his first paper on distributive laws. In addition, we establish conditions under which there can be at most one possible distributive law between two monads, proving various known distributive laws to be unique.
Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to standard classification setting. Finally, we provide a lower bound on minimax risk proving that the upper bound is almost optimal.
Recently, random walks on dynamic graphs have been studied because of their adaptivity to the time-varying structure of real-world networks. In general, there is a tremendous gap between static and dynamic graph settings for the lazy simple random walk: Although $O(n^3)$ cover time was shown for any static graphs of $n$ vertices, there is an edge-changing dynamic graph with an exponential hitting time. On the other hand, previous works indicate that the random walk on a dynamic graph with a time-homogeneous stationary distribution behaves almost identically to that on a static graph. In this paper, we strengthen this insight by obtaining general and improved bounds. Specifically, we consider a random walk according to a sequence $(P_t)_{t\geq 1}$ of irreducible and reversible transition matrices such that all $P_t$ have the same stationary distribution. We bound the mixing, hitting, and cover times in terms of the hitting and relaxation times of the random walk according to the worst fixed $P_t$. Moreover, we obtain the first bounds of the hitting and cover times of multiple random walks and the coalescing time on dynamic graphs. These bounds can be seen as an extension of the well-known bounds of random walks on static graphs. Our results generalize the previous upper bounds for specific random walks on dynamic graphs, e.g., lazy simple random walks and $d_{\max}$-lazy walks, and give improved and tight upper bounds in various cases. As an interesting consequence of our generalization, we obtain tight bounds for the lazy Metropolis walk [Nonaka, Ono, Sadakane, and Yamashita, TCS10] on any dynamic graph: $O(n^2)$ mixing time, $O(n^2)$ hitting time, and $O(n^2\log n)$ cover time. Additionally, our coalescing time bound implies the consensus time bound of the pull voting on a dynamic graph.
This paper introduces Stochastic Gradient Langevin Boosting (SGLB) - a powerful and efficient machine learning framework that may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of the Langevin diffusion equation specifically designed for gradient boosting. This allows us to theoretically guarantee the global convergence even for multimodal loss functions, while standard gradient boosting algorithms can guarantee only local optimum. We also empirically show that SGLB outperforms classic gradient boosting when applied to classification tasks with 0-1 loss function, which is known to be multimodal.
While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent work has shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al.~\cite{AroraGeLiMaZh17} suggests a dilemma about GANs' statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse. In contrast, we show in this paper that GANs can in principle learn distributions in Wasserstein distance (or KL-divergence in many cases) with polynomial sample complexity, if the discriminator class has strong distinguishing power against the particular generator class (instead of against all possible generators). For various generator classes such as mixture of Gaussians, exponential families, and invertible neural networks generators, we design corresponding discriminators (which are often neural nets of specific architectures) such that the Integral Probability Metric (IPM) induced by the discriminators can provably approximate the Wasserstein distance and/or KL-divergence. This implies that if the training is successful, then the learned distribution is close to the true distribution in Wasserstein distance or KL divergence, and thus cannot drop modes. Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence, indicating that the lack of diversity may be caused by the sub-optimality in optimization instead of statistical inefficiency.