We investigate predictive densities for multivariate normal models with unknown mean vectors and known covariance matrices. Bayesian predictive densities based on shrinkage priors often have complex representations, although they are effective in various problems. We consider extended normal models with mean vectors and covariance matrices as parameters, and adopt predictive densities that belong to the extended models including the original normal model. We adopt predictive densities that are optimal with respect to the posterior Bayes risk in the extended models. The proposed predictive density based on a superharmonic shrinkage prior is shown to dominate the Bayesian predictive density based on the uniform prior under a loss function based on the Kullback-Leibler divergence. Our method provides an alternative to the empirical Bayes method, which is widely used to construct tractable predictive densities.
We consider the setting of online convex optimization (OCO) with \textit{exp-concave} losses. The best regret bound known for this setting is $O(n\log{}T)$, where $n$ is the dimension and $T$ is the number of prediction rounds (treating all other quantities as constants and assuming $T$ is sufficiently large), and is attainable via the well-known Online Newton Step algorithm (ONS). However, ONS requires on each iteration to compute a projection (according to some matrix-induced norm) onto the feasible convex set, which is often computationally prohibitive in high-dimensional settings and when the feasible set admits a non-trivial structure. In this work we consider projection-free online algorithms for exp-concave and smooth losses, where by projection-free we refer to algorithms that rely only on the availability of a linear optimization oracle (LOO) for the feasible set, which in many applications of interest admits much more efficient implementations than a projection oracle. We present an LOO-based ONS-style algorithm, which using overall $O(T)$ calls to a LOO, guarantees in worst case regret bounded by $\widetilde{O}(n^{2/3}T^{2/3})$ (ignoring all quantities except for $n,T$). However, our algorithm is most interesting in an important and plausible low-dimensional data scenario: if the gradients (approximately) span a subspace of dimension at most $\rho$, $\rho << n$, the regret bound improves to $\widetilde{O}(\rho^{2/3}T^{2/3})$, and by applying standard deterministic sketching techniques, both the space and average additional per-iteration runtime requirements are only $O(\rho{}n)$ (instead of $O(n^2)$). This improves upon recently proposed LOO-based algorithms for OCO which, while having the same state-of-the-art dependence on the horizon $T$, suffer from regret/oracle complexity that scales with $\sqrt{n}$ or worse.
With the proliferation of mobile devices, an increasing amount of population data is being collected, and there is growing demand to use the large-scale, multidimensional data in real-world situations. We introduced functional data analysis (FDA) into the problem of predicting the hourly population of different districts of Tokyo. FDA is a methodology that treats and analyzes longitudinal data as curves, which reduces the number of parameters and makes it easier to handle high-dimensional data. Specifically, by assuming a Gaussian process, we avoided the large covariance matrix parameters of the multivariate normal distribution. In addition, the data were time and spatially dependent between districts. To capture these characteristics, a Bayesian factor model was introduced, which modeled the time series of a small number of common factors and expressed the spatial structure in terms of factor loading matrices. Furthermore, the factor loading matrices were made identifiable and sparse to ensure the interpretability of the model. We also proposed a method for selecting factors using the Bayesian shrinkage method. We studied the forecast accuracy and interpretability of the proposed method through numerical experiments and data analysis. We found that the flexibility of our proposed method could be extended to reflect further time series features, which contributed to the accuracy.
This paper is concerned with orthonormal systems in real intervals, given with zero Dirichlet boundary conditions. More specifically, our interest is in systems with a skew-symmetric differentiation matrix (this excludes orthonormal polynomials). We consider a simple construction of such systems and pursue its ramifications. In general, given any $\mathrm{C}^1(a,b)$ weight function such that $w(a)=w(b)=0$, we can generate an orthonormal system with a skew-symmetric differentiation matrix. Except for the case $a=-\infty$, $b=+\infty$, only a limited number of powers of that matrix is bounded and we establish a connection between properties of the weight function and boundedness. In particular, we examine in detail two weight functions: the Laguerre weight function $x^\alpha \mathrm{e}^{-x}$ for $x>0$ and $\alpha>0$ and the ultraspherical weight function $(1-x^2)^\alpha$, $x\in(-1,1)$, $\alpha>0$, and establish their properties. Both weights share a most welcome feature of {\em separability,\/} which allows for fast computation. The quality of approximation is highly sensitive to the choice of $\alpha$ and we discuss how to choose optimally this parameter, depending on the number of zero boundary conditions.
In this paper we give a completely new approach to the problem of covariate selection in linear regression. A covariate or a set of covariates is included only if it is better in the sense of least squares than the same number of Gaussian covariates consisting of i.i.d. $N(0,1)$ random variables. The Gaussian P-value is defined as the probability that the Gaussian covariates are better. It is given in terms of the Beta distribution, it is exact and it holds for all data. The covariate selection procedures based on this require only a cut-off value $\alpha$ for the Gaussian P-value: the default value in this paper is $\alpha=0.01$. The resulting procedures are very simple, very fast, do not overfit and require only least squares. In particular there is no regularization parameter, no data splitting, no use of simulations, no shrinkage and no post selection inference is required. The paper includes the results of simulations, applications to real data sets and theorems on the asymptotic behaviour under the standard linear model. Here the stepwise procedure performs overwhelmingly better than any other procedure we are aware of. An R-package {\it gausscov} is available.
We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. We derive time-uniform generalizations of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds and, more importantly, general techniques for constructing them. Despite being anytime-valid, our extensions remain as tight as their fixed-time counterparts. Moreover, they enable us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
This article tackles the old problem of prediction via a nonparametric transformation model (NTM) in a new Bayesian way. Estimation of NTMs is known challenging due to model unidentifiability though appealing because of its robust prediction capability in survival analysis. Inspired by the uniqueness of the posterior predictive distribution, we achieve efficient prediction via the NTM aforementioned under the Bayesian paradigm. Our strategy is to assign weakly informative priors to nonparametric components rather than identify the model by adding complicated constraints in the existing literature. The Bayesian success pays tribute to i) a subtle cast of NTMs by an exponential transformation for the purpose of compressing spaces of infinite-dimensional parameters to positive quadrants considering non-negativity of the failure time; ii) a newly constructed weakly informative quantile-knots I-splines prior for the recast transformation function together with the Dirichlet process mixture model assigned to the error distribution. In addition, we provide a convenient and precise estimator for the identified parameter component subject to the general unit-norm restriction through posterior modification, enabling effective relative risks. Simulations and applications on real datasets reveal that our method is robust and outperforms the competing methods. An R package BuLTM is available to predict survival curves, estimate relative risks, and facilitate posterior checking.
Federated learning methods, that is, methods that perform model training using data situated across different sources, whilst simultaneously not having the data leave their original source, are of increasing interest in a number of fields. However, despite this interest, the classes of models for which easily-applicable and sufficiently general approaches are available is limited, excluding many structured probabilistic models. We present a general yet elegant resolution to the aforementioned issue. The approach is based on adopting structured variational inference, an approach widely used in Bayesian machine learning, to the federated setting. Additionally, a communication-efficient variant analogous to the canonical FedAvg algorithm is explored. The effectiveness of the proposed algorithms are demonstrated, and their performance is compared on Bayesian multinomial regression, topic modelling, and mixed model examples.
Bayesian dynamic modeling and forecasting is developed in the setting of sequential time series analysis for causal inference. Causal evaluation of sequentially observed time series data from control and treated units focuses on the impacts of interventions using synthetic control constructs. Methodological contributions include the development of multivariate dynamic models for time-varying effects across multiple treated units and explicit foci on sequential learning of effects of interventions. Analysis explores the utility of dimension reduction of multiple potential synthetic control variables. These methodological advances are evaluated in a detailed case study in commercial forecasting. This involves in-study evaluation of interventions in a supermarket promotions experiment, with coupled predictive analyses in selected regions of a large-scale commercial system. Generalization of causal predictive inferences from experimental settings to broader populations is a central concern, and one that can be impacted by cross-series dependencies.
Bayesian inverse problems are often computationally challenging when the forward model is governed by complex partial differential equations (PDEs). This is typically caused by expensive forward model evaluations and high-dimensional parameterization of priors. This paper proposes a domain-decomposed variational auto-encoder Markov chain Monte Carlo (DD-VAE-MCMC) method to tackle these challenges simultaneously. Through partitioning the global physical domain into small subdomains, the proposed method first constructs local deterministic generative models based on local historical data, which provide efficient local prior representations. Gaussian process models with active learning address the domain decomposition interface conditions. Then inversions are conducted on each subdomain independently in parallel and in low-dimensional latent parameter spaces. The local inference solutions are post-processed through the Poisson image blending procedure to result in an efficient global inference result. Numerical examples are provided to demonstrate the performance of the proposed method.
In Bayesian analysis, the selection of a prior distribution is typically done by considering each parameter in the model. While this can be convenient, in many scenarios it may be desirable to place a prior on a summary measure of the model instead. In this work, we propose a prior on the model fit, as measured by a Bayesian coefficient of determination (R2), which then induces a prior on the individual parameters. We achieve this by placing a beta prior on R2 and then deriving the induced prior on the global variance parameter for generalized linear mixed models. We derive closed-form expressions in many scenarios and present several approximation strategies when an analytic form is not possible and/or to allow for easier computation. In these situations, we suggest approximating the prior by using a generalized beta prime distribution and provide a simple default prior construction scheme. This approach is quite flexible and can be easily implemented in standard Bayesian software. Lastly, we demonstrate the performance of the method on simulated data, where it particularly shines in high-dimensional examples, as well as real-world data, which shows its ability to model spatial correlation in the random effects.