We consider the analysis of probability distributions through their associated covariance operators from reproducing kernel Hilbert spaces. We show that the von Neumann entropy and relative entropy of these operators are intimately related to the usual notions of Shannon entropy and relative entropy, and share many of their properties. They come together with efficient estimation algorithms from various oracles on the probability distributions. We also consider product spaces and show that for tensor product kernels, we can define notions of mutual information and joint entropies, which can then characterize independence perfectly, but only partially conditional independence. We finally show how these new notions of relative entropy lead to new upper-bounds on log partition functions, that can be used together with convex optimization within variational inference methods, providing a new family of probabilistic inference methods.
In Statistical Relational Artificial Intelligence, a branch of AI and machine learning which combines the logical and statistical schools of AI, one uses the concept {\em para\-metrized probabilistic graphical model (PPGM)} to model (conditional) dependencies between random variables and to make probabilistic inferences about events on a space of "possible worlds". The set of possible worlds with underlying domain $D$ (a set of objects) can be represented by the set $\mathbf{W}_D$ of all first-order structures (for a suitable signature) with domain $D$. Using a formal logic we can describe events on $\mathbf{W}_D$. By combining a logic and a PPGM we can also define a probability distribution $\mathbb{P}_D$ on $\mathbf{W}_D$ and use it to compute the probability of an event. We consider a logic, denoted $PLA$, with truth values in the unit interval, which uses aggregation functions, such as arithmetic mean, geometric mean, maximum and minimum instead of quantifiers. However we face the problem of computational efficiency and this problem is an obstacle to the wider use of methods from Statistical Relational AI in practical applications. We address this problem by proving that the described probability will, under certain assumptions on the PPGM and the sentence $\varphi$, converge as the size of $D$ tends to infinity. The convergence result is obtained by showing that every formula $\varphi(x_1, \ldots, x_k)$ which contains only "admissible" aggregation functions (e.g. arithmetic and geometric mean, max and min) is asymptotically equivalent to a formula $\psi(x_1, \ldots, x_k)$ without aggregation functions.
We introduce and analyze various Regularized Combined Field Integral Equations (CFIER) formulations of time-harmonic Navier equations in media with piece-wise constant material properties. These formulations can be derived systematically starting from suitable coercive approximations of Dirichlet-to-Neumann operators (DtN), and we present a periodic pseudodifferential calculus framework within which the well posedness of CIER formulations can be established. We also use the DtN approximations to derive and analyze Optimized Schwarz (OS) methods for the solution of elastodynamics transmission problems. The pseudodifferential calculus we develop in this paper relies on careful singularity splittings of the kernels of Navier boundary integral operators which is also the basis of high-order Nystr\"om quadratures for their discretizations. Based on these high-order discretizations we investigate the rate of convergence of iterative solvers applied to CFIER and OS formulations of scattering and transmission problems. We present a variety of numerical results that illustrate that the CFIER methodology leads to important computational savings over the classical CFIE one, whenever iterative solvers are used for the solution of the ensuing discretized boundary integral equations. Finally, we show that the OS methods are competitive in the high-frequency high-contrast regime.
Non-probability sampling is prevailing in survey sampling, but ignoring its selection bias leads to erroneous inferences. We offer a unified nonparametric calibration method to estimate the sampling weights for a non-probability sample by calibrating functions of auxiliary variables in a reproducing kernel Hilbert space. The consistency and the limiting distribution of the proposed estimator are established, and the corresponding variance estimator is also investigated. Compared with existing works, the proposed method is more robust since no parametric assumption is made for the selection mechanism of the non-probability sample. Numerical results demonstrate that the proposed method outperforms its competitors, especially when the model is misspecified. The proposed method is applied to analyze the average total cholesterol of Korean citizens based on a non-probability sample from the National Health Insurance Sharing Service and a reference probability sample from the Korea National Health and Nutrition Examination Survey.
We employ kernel-based approaches that use samples from a probability distribution to approximate a Kolmogorov operator on a manifold. The self-tuning variable-bandwidth kernel method [Berry & Harlim, Appl. Comput. Harmon. Anal., 40(1):68--96, 2016] computes a large, sparse matrix that approximates the differential operator. Here, we use the eigendecomposition of the discretization to (i) invert the operator, solving a differential equation, and (ii) represent gradient vector fields on the manifold. These methods only require samples from the underlying distribution and, therefore, can be applied in high dimensions or on geometrically complex manifolds when spatial discretizations are not available. We also employ an efficient $k$-$d$ tree algorithm to compute the sparse kernel matrix, which is a computational bottleneck.
In this paper we get error bounds for fully discrete approximations of infinite horizon problems via the dynamic programming approach. It is well known that considering a time discretization with a positive step size $h$ an error bound of size $h$ can be proved for the difference between the value function (viscosity solution of the Hamilton-Jacobi-Bellman equation corresponding to the infinite horizon) and the value function of the discrete time problem. However, including also a spatial discretization based on elements of size $k$ an error bound of size $O(k/h)$ can be found in the literature for the error between the value functions of the continuous problem and the fully discrete problem. In this paper we revise the error bound of the fully discrete method and prove, under similar assumptions to those of the time discrete case, that the error of the fully discrete case is in fact $O(h+k)$ which gives first order in time and space for the method. This error bound matches the numerical experiments of many papers in the literature in which the behaviour $1/h$ from the bound $O(k/h)$ have not been observed.
We provide a decision theoretic analysis of bandit experiments. The setting corresponds to a dynamic programming problem, but solving this directly is typically infeasible. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for bandit experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a nonlinear second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distribution of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and therefore suggests a practical strategy for dimension reduction. The upshot is that we can approximate the dynamic programming problem defining the bandit experiment with a PDE which can be efficiently solved using sparse matrix routines. We derive the optimal Bayes and minimax policies from the numerical solutions to these equations. The proposed policies substantially dominate existing methods such as Thompson sampling. The framework also allows for substantial generalizations to the bandit problem such as time discounting and pure exploration motives.
Binding operation is fundamental to many cognitive processes, such as cognitive map formation, relational reasoning, and language comprehension. In these processes, two different modalities, such as location and objects, events and their contextual cues, and words and their roles, need to be bound together, but little is known about the underlying neural mechanisms. Previous works introduced a binding model based on quadratic functions of bound pairs, followed by vector summation of multiple pairs. Based on this framework, we address following questions: Which classes of quadratic matrices are optimal for decoding relational structures? And what is the resultant accuracy? We introduce a new class of binding matrices based on a matrix representation of octonion algebra, an eight-dimensional extension of complex numbers. We show that these matrices enable a more accurate unbinding than previously known methods when a small number of pairs are present. Moreover, numerical optimization of a binding operator converges to this octonion binding. We also show that when there are a large number of bound pairs, however, a random quadratic binding performs as well as the octonion and previously-proposed binding methods. This study thus provides new insight into potential neural mechanisms of binding operations in the brain.
We recall some of the history of the information-theoretic approach to deriving core results in probability theory and indicate parts of the recent resurgence of interest in this area with current progress along several interesting directions. Then we give a new information-theoretic proof of a finite version of de Finetti's classical representation theorem for finite-valued random variables. We derive an upper bound on the relative entropy between the distribution of the first $k$ in a sequence of $n$ exchangeable random variables, and an appropriate mixture over product distributions. The mixing measure is characterised as the law of the empirical measure of the original sequence, and de Finetti's result is recovered as a corollary. The proof is nicely motivated by the Gibbs conditioning principle in connection with statistical mechanics, and it follows along an appealing sequence of steps. The technical estimates required for these steps are obtained via the use of a collection of combinatorial tools known within information theory as `the method of types.'
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.