A Bayesian treatment can mitigate overconfidence in ReLU nets around the training data. But far away from them, ReLU Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be asymptotically overconfident. This issue arises since the output variance of a BNN with finitely many features is quadratic in the distance from the data region. Meanwhile, Bayesian linear models with ReLU features converge, in the infinite-width limit, to a particular Gaussian process (GP) with a variance that grows cubically so that no asymptotic overconfidence can occur. While this may seem of mostly theoretical interest, in this work, we show that it can be used in practice to the benefit of BNNs. We extend finite ReLU BNNs with infinite ReLU features via the GP and show that the resulting model is asymptotically maximally uncertain far away from the data while the BNNs' predictive power is unaffected near the data. Although the resulting model approximates a full GP posterior, thanks to its structure, it can be applied \emph{post-hoc} to any pre-trained ReLU BNN at a low cost.
We consider Bayesian optimization of the output of a network of functions, where each function takes as input the output of its parent nodes, and where the network takes significant time to evaluate. Such problems arise, for example, in reinforcement learning, engineering design, and manufacturing. While the standard Bayesian optimization approach observes only the final output, our approach delivers greater query efficiency by leveraging information that the former ignores: intermediate output within the network. This is achieved by modeling the nodes of the network using Gaussian processes and choosing the points to evaluate using, as our acquisition function, the expected improvement computed with respect to the implied posterior on the objective. Although the non-Gaussian nature of this posterior prevents computing our acquisition function in closed form, we show that it can be efficiently maximized via sample average approximation. In addition, we prove that our method is asymptotically consistent, meaning that it finds a globally optimal solution as the number of evaluations grows to infinity, thus generalizing previously known convergence results for the expected improvement. Notably, this holds even though our method might not evaluate the domain densely, instead leveraging problem structure to leave regions unexplored. Finally, we show that our approach dramatically outperforms standard Bayesian optimization methods in several synthetic and real-world problems.
In this paper, we consider the estimation of a continuous treatment effect model in the presence of treatment spillovers through social networks. We assume that one's outcome is affected not only by his/her own treatment but also by the average of his/her neighbors' treatments, both of which are treated as endogenous variables. Using a control function approach with appropriate instrumental variables, in conjunction with some functional form restrictions, we show that the conditional mean potential outcome can be nonparametrically identified. We also consider a more empirically tractable semiparametric model and develop a three-step estimation procedure for this model. The consistency and asymptotic normality of the proposed estimator are established under certain regularity conditions. As an empirical illustration, we investigate the causal effect of the regional unemployment rate on the crime rate using Japanese city data.
This paper proposes a general two directional simultaneous inference (TOSI) framework for high-dimensional models with a manifest variable or latent variable structure, for example, high-dimensional mean models, high-dimensional sparse regression models, and high-dimensional latent factors models. TOSI performs simultaneous inference on a set of parameters from two directions, one to test whether the assumed zero parameters indeed are zeros and one to test whether exist zeros in the parameter set of nonzeros. As a result, we can exactly identify whether the parameters are zeros, thereby keeping the data structure fully and parsimoniously expressed. We theoretically prove that the proposed TOSI method asymptotically controls the Type I error at the prespecified significance level and that the testing power converges to one. Simulations are conducted to examine the performance of the proposed method in finite sample situations and two real datasets are analyzed. The results show that the TOSI method is more predictive and has more interpretable estimators than existing methods.
Variance estimation is important for statistical inference. It becomes non-trivial when observations are masked by serial dependence structures and time-varying mean structures. Existing methods either ignore or sub-optimally handle these nuisance structures. This paper develops a general framework for the estimation of the long-run variance for time series with non-constant means. The building blocks are difference statistics. The proposed class of estimators is general enough to cover many existing estimators. Necessary and sufficient conditions for consistency are investigated. The first asymptotically optimal estimator is derived. Our proposed estimator is theoretically proven to be invariant to arbitrary mean structures, which may include trends and a possibly divergent number of discontinuities.
We introduce a nonparametric graphical model for discrete node variables based on additive conditional independence. Additive conditional independence is a three way statistical relation that shares similar properties with conditional independence by satisfying the semi-graphoid axioms. Based on this relation we build an additive graphical model for discrete variables that does not suffer from the restriction of a parametric model such as the Ising model. We develop an estimator of the new graphical model via the penalized estimation of the discrete version of the additive precision operator and establish the consistency of the estimator under the ultrahigh-dimensional setting. Along with these methodological developments, we also exploit the properties of discrete random variables to uncover a deeper relation between additive conditional independence and conditional independence than previously known. The new graphical model reduces to a conditional independence graphical model under certain sparsity conditions. We conduct simulation experiments and analysis of an HIV antiretroviral therapy data set to compare the new method with existing ones.
This study concerns probability distribution estimation of sample maximum. The traditional approach is the parametric fitting to the limiting distribution - the generalized extreme value distribution; however, the model in finite cases is misspecified to a certain extent. We propose a plug-in type of the kernel distribution estimator which does not need model specification. It is proved that both asymptotic convergence rates depend on the tail index and the second order parameter. As the tail gets light, the degree of misspecification of the parametric fitting becomes large, that means the convergence rate becomes slow. In the Weibull cases, which can be seen as the limit of tail-lightness, only the nonparametric distribution estimator keeps its consistency. Finally, we report results of numerical experiments and two real case studies.
This work studies an experimental design problem where $x$'s are to be selected with the goal of estimating a function $m(x)$, which is observed with noise. A linear model is fitted to $m(x)$ but it is not assumed that the model is correctly specified. It follows that the quantity of interest is the best linear approximation of $m(x)$, which is denoted by $\ell(x)$. It is shown that in this framework the ordinary least squares estimator typically leads to an inconsistent estimation of $\ell(x)$, and rather weighted least squares should be considered. An asymptotic minimax criterion is formulated for this estimator, and a design that minimizes the criterion is constructed. An important feature of this problem is that the $x$'s should be random, rather than fixed. Otherwise, the minimax risk is infinite. It is shown that the optimal random minimax design is different from its deterministic counterpart, which was studied previously, and a simulation study indicates that it generally performs better when $m(x)$ is a quadratic or a cubic function. Another finding is that when the variance of the noise goes to infinity, the random and deterministic minimax designs coincide. The results are illustrated for polynomial regression models and different generalizations are presented.
We consider the problem of static Bayesian inference for partially observed L\'{e}vy-process models. We develop a methodology which allows one to infer static parameters and some states of the process, without a bias from the time-discretization of the afore-mentioned L\'{e}vy process. The unbiased method is exceptionally amenable to parallel implementation and can be computationally efficient relative to competing approaches. We implement the method on S \& P 500 log-return daily data and compare it to some Markov chain Monte Carlo (MCMC) algorithm.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.