In Bayesian density estimation, a question of interest is how the number of components in a finite mixture model grows with the number of observations. We provide a novel perspective on this question by using results from stochastic geometry to find that the growth rate of the expected number of components of a finite mixture model whose components belong to the unit simplex $\Delta^{J-1}$ of the Euclidean space $\mathbb{R}^J$ is $(\log n)^{J-1}$. We also provide a central limit theorem for the number of components. In addition, we relate our model to a classical non-parametric density estimator based on a P\'olya tree. Combining this latter with techniques from Choquet theory, we are able to retrieve mixture weights. We also give the rate of convergence of the P\'olya tree posterior to the Dirac measure on the weights. We further present an algorithm to correctly specify the number of components in a latent Dirichlet allocation (LDA) analysis.
We analyze the complexity of learning directed acyclic graphical models from observational data in general settings without specific distributional assumptions. Our approach is information-theoretic and uses a local Markov boundary search procedure in order to recursively construct ancestral sets in the underlying graphical model. Perhaps surprisingly, we show that for certain graph ensembles, a simple forward greedy search algorithm (i.e. without a backward pruning phase) suffices to learn the Markov boundary of each node. This substantially improves the sample complexity, which we show is at most polynomial in the number of nodes. This is then applied to learn the entire graph under a novel identifiability condition that generalizes existing conditions from the literature. As a matter of independent interest, we establish finite-sample guarantees for the problem of recovering Markov boundaries from data. Moreover, we apply our results to the special case of polytrees, for which the assumptions simplify, and provide explicit conditions under which polytrees are identifiable and learnable in polynomial time. We further illustrate the performance of the algorithm, which is easy to implement, in a simulation study. Our approach is general, works for discrete or continuous distributions without distributional assumptions, and as such sheds light on the minimal assumptions required to efficiently learn the structure of directed graphical models from data.
We study the frontier between learnable and unlearnable hidden Markov models (HMMs). HMMs are flexible tools for clustering dependent data coming from unknown populations. The model parameters are known to be fully identifiable (up to label-switching) without any modeling assumption on the distributions of the populations as soon as the clusters are distinct and the hidden chain is ergodic with a full rank transition matrix. In the limit as any one of these conditions fails, it becomes impossible in general to identify parameters. For a chain with two hidden states we prove nonasymptotic minimax upper and lower bounds, matching up to constants, which exhibit thresholds at which the parameters become learnable. We also provide an upper bound on the relative entropy rate for parameters in a neighbourhood of the unlearnable region which may have interest in itself.
Stochastic gradient Markov chain Monte Carlo (SGMCMC) is considered the gold standard for Bayesian inference in large-scale models, such as Bayesian neural networks. Since practitioners face speed versus accuracy tradeoffs in these models, variational inference (VI) is often the preferable option. Unfortunately, VI makes strong assumptions on both the factorization and functional form of the posterior. In this work, we propose a new non-parametric variational approximation that makes no assumptions about the approximate posterior's functional form and allows practitioners to specify the exact dependencies the algorithm should respect or break. The approach relies on a new Langevin-type algorithm that operates on a modified energy function, where parts of the latent variables are averaged over samples from earlier iterations of the Markov chain. This way, statistical dependencies can be broken in a controlled way, allowing the chain to mix faster. This scheme can be further modified in a "dropout" manner, leading to even more scalability. By implementing the scheme on a ResNet-20 architecture, we obtain better predictive likelihoods and larger effective sample sizes than full SGMCMC.
This article introduces a novel nonparametric methodology for Generalized Linear Models which combines the strengths of the binary regression and latent variable formulations for categorical data, while overcoming their disadvantages. Requiring minimal assumptions, it extends recently published parametric versions of the methodology and generalizes it. If the underlying data generating process is asymmetric, it gives uniformly better prediction and inference performance over the parametric formulation. Furthermore, it introduces a new classification statistic utilizing which I show that overall, it has better model fit, inference and classification performance than the parametric version, and the difference in performance is statistically significant especially if the data generating process is asymmetric. In addition, the methodology can be used to perform model diagnostics for any model specification. This is a highly useful result, and it extends existing work for categorical model diagnostics broadly across the sciences. The mathematical results also highlight important new findings regarding the interplay of statistical significance and scientific significance. Finally, the methodology is applied to various real-world datasets to show that it may outperform widely used existing models, including Random Forests and Deep Neural Networks with very few iterations.
Traditional methods for unsupervised learning of finite mixture models require to evaluate the likelihood of all components of the mixture. This becomes computationally prohibitive when the number of components is large, as it is, for example, in the sum-product (transform) networks. Therefore, we propose to apply a combination of the expectation maximization and the Metropolis-Hastings algorithm to evaluate only a small number of, stochastically sampled, components, thus substantially reducing the computational cost. The Markov chain of component assignments is sequentially generated across the algorithm's iterations, having a non-stationary target distribution whose parameters vary via a gradient-descent scheme. We put emphasis on generality of our method, equipping it with the ability to train both shallow and deep mixture models which involve complex, and possibly nonlinear, transformations. The performance of our method is illustrated in a variety of synthetic and real-data contexts, considering deep models, such as mixtures of normalizing flows and sum-product (transform) networks.
We consider a Johnson-N\'ed\'elec FEM-BEM coupling, which is a direct and non-symmetric coupling of finite and boundary element methods, in order to solve interface problems for the magnetostatic Maxwell's equations with the magnetic vector potential ansatz. In the FEM-domain, equations may be non-linear, whereas they are exclusively linear in the BEM-part to guarantee the existence of a fundamental solution. First, the weak problem is formulated in quotient spaces to avoid resolving to a saddle point problem. Second, we establish in this setting well-posedness of the arising problem using the framework of Lipschitz and strongly monotone operators as well as a stability result for a special type of non-linearity, which is typically considered in magnetostatic applications. Then, the discretization is performed in the isogeometric context, i.e., the same type of basis functions that are used for geometry design are considered as ansatz functions for the discrete setting. In particular, NURBS are employed for geometry considerations, and B-Splines, which can be understood as a special type of NURBS, for analysis purposes. In this context, we derive a priori estimates w.r.t. h-refinement, and point out to an interesting behavior of BEM, which consists in an amelioration of the convergence rates, when a functional of the solution is evaluated in the exterior BEM-domain. This improvement may lead to a doubling of the convergence rate under certain assumptions. Finally, we end the paper with a numerical example to illustrate the theoretical results, along with a conclusion and an outlook.
An interesting observation in artificial neural networks is their favorable generalization error despite typically being extremely overparameterized. It is well known that the classical statistical learning methods often result in vacuous generalization errors in the case of overparameterized neural networks. Adopting the recently developed Neural Tangent (NT) kernel theory, we prove uniform generalization bounds for overparameterized neural networks in kernel regimes, when the true data generating model belongs to the reproducing kernel Hilbert space (RKHS) corresponding to the NT kernel. Importantly, our bounds capture the exact error rates depending on the differentiability of the activation functions. In order to establish these bounds, we propose the information gain of the NT kernel as a measure of complexity of the learning problem. Our analysis uses a Mercer decomposition of the NT kernel in the basis of spherical harmonics and the decay rate of the corresponding eigenvalues. As a byproduct of our results, we show the equivalence between the RKHS corresponding to the NT kernel and its counterpart corresponding to the Mat\'ern family of kernels, showing the NT kernels induce a very general class of models. We further discuss the implications of our analysis for some recent results on the regret bounds for reinforcement learning and bandit algorithms, which use overparameterized neural networks.
We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.