Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We argue that this lead to a form of instability that lies at the heart of their generative capabilities and that can be described by a set of mean field critical exponents. We conclude by analyzing recent work connecting diffusion models and associative memory networks in view of the thermodynamic formulations.
Influenced mixed moving average fields are a versatile modeling class for spatio-temporal data. However, their predictive distribution is not generally known. Under this modeling assumption, we define a novel spatio-temporal embedding and a theory-guided machine learning approach that employs a generalized Bayesian algorithm to make ensemble forecasts. We employ Lipschitz predictors and determine fixed-time and any-time PAC Bayesian bounds in the batch learning setting. Performing causal forecast is a highlight of our methodology as its potential application to data with spatial and temporal short and long-range dependence. We then test the performance of our learning methodology by using linear predictors and data sets simulated from a spatio-temporal Ornstein-Uhlenbeck process.
Numerical analysis for the stochastic Stokes equations is still challenging even though it has been well done for the corresponding deterministic equations. In particular, the pre-existing error estimates of finite element methods for the stochastic Stokes equations { in the $L^\infty(0, T; L^2(\Omega; L^2))$ norm} all suffer from the order reduction with respect to the spatial discretizations. The best convergence result obtained for these fully discrete schemes is only half-order in time and first-order in space, which is not optimal in space in the traditional sense. The objective of this article is to establish strong convergence of $O(\tau^{1/2}+ h^2)$ in the $L^\infty(0, T; L^2(\Omega; L^2))$ norm for approximating the velocity, and strong convergence of $O(\tau^{1/2}+ h)$ in the $L^{\infty}(0, T;L^2(\Omega;L^2))$ norm for approximating the time integral of pressure, where $\tau$ and $h$ denote the temporal step size and spatial mesh size, respectively. The error estimates are of optimal order for the spatial discretization considered in this article (with MINI element), and consistent with the numerical experiments. The analysis is based on the fully discrete Stokes semigroup technique and the corresponding new estimates.
We consider the estimation of generalized additive models using basis expansions coupled with Bayesian model selection. Although Bayesian model selection is an intuitively appealing tool for regression splines, its use has traditionally been limited to Gaussian additive regression because of the availability of a tractable form of the marginal model likelihood. We extend the method to encompass the exponential family of distributions using the Laplace approximation to the likelihood. Although the approach exhibits success with any Gaussian-type prior distribution, there remains a lack of consensus regarding the best prior distribution for nonparametric regression through model selection. We observe that the classical unit information prior distribution for variable selection may not be well-suited for nonparametric regression using basis expansions. Instead, our investigation reveals that mixtures of g-priors are more suitable. We consider various mixtures of g-priors to evaluate the performance in estimating generalized additive models. Furthermore, we conduct a comparative analysis of several priors for knots to identify the most practically effective strategy. Our extensive simulation studies demonstrate the superiority of model selection-based approaches over other Bayesian methods.
The impact of outliers and anomalies on model estimation and data processing is of paramount importance, as evidenced by the extensive body of research spanning various fields over several decades: thousands of research papers have been published on the subject. As a consequence, numerous reviews, surveys, and textbooks have sought to summarize the existing literature, encompassing a wide range of methods from both the statistical and data mining communities. While these endeavors to organize and summarize the research are invaluable, they face inherent challenges due to the pervasive nature of outliers and anomalies in all data-intensive applications, irrespective of the specific application field or scientific discipline. As a result, the resulting collection of papers remains voluminous and somewhat heterogeneous. To address the need for knowledge organization in this domain, this paper implements the first systematic meta-survey of general surveys and reviews on outlier and anomaly detection. Employing a classical systematic survey approach, the study collects nearly 500 papers using two specialized scientific search engines. From this comprehensive collection, a subset of 56 papers that claim to be general surveys on outlier detection is selected using a snowball search technique to enhance field coverage. A meticulous quality assessment phase further refines the selection to a subset of 25 high-quality general surveys. Using this curated collection, the paper investigates the evolution of the outlier detection field over a 20-year period, revealing emerging themes and methods. Furthermore, an analysis of the surveys sheds light on the survey writing practices adopted by scholars from different communities who have contributed to this field. Finally, the paper delves into several topics where consensus has emerged from the literature. These include taxonomies of outlier types, challenges posed by high-dimensional data, the importance of anomaly scores, the impact of learning conditions, difficulties in benchmarking, and the significance of neural networks. Non-consensual aspects are also discussed, particularly the distinction between local and global outliers and the challenges in organizing detection methods into meaningful taxonomies.
In survival analysis, complex machine learning algorithms have been increasingly used for predictive modeling. Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. In particular, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of infection over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to infection are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 study to inform enrollment strategies for future HIV vaccine trials.
Gaussian graphical models are useful tools for conditional independence structure inference of multivariate random variables. Unfortunately, Bayesian inference of latent graph structures is challenging due to exponential growth of $\mathcal{G}_n$, the set of all graphs in $n$ vertices. One approach that has been proposed to tackle this problem is to limit search to subsets of $\mathcal{G}_n$. In this paper, we study subsets that are vector subspaces with the cycle space $\mathcal{C}_n$ as main example. We propose a novel prior on $\mathcal{C}_n$ based on linear combinations of cycle basis elements and present its theoretical properties. Using this prior, we implement a Markov chain Monte Carlo algorithm, and show that (i) posterior edge inclusion estimates computed with our technique are comparable to estimates from the standard technique despite searching a smaller graph space, and (ii) the vector space perspective enables straightforward implementation of MCMC algorithms.
This research note provides algebraic characterizations of the least model, subsumption, and uniform equivalence of propositional Krom logic programs.
We formalize an interpretational error that is common in statistical causal inference, termed identity slippage. This formalism is used to describe historically-recognized fallacies, and analyse a fast-growing literature in statistics and applied fields. We conducted a systematic review of natural language claims in the literature on stochastic mediation parameters, and documented extensive evidence of identity slippage in applications. This framework for error detection is applicable whenever policy decisions depend on the accurate interpretation of statistical results, which is nearly always the case. Therefore, broad awareness of identity slippage will aid statisticians in the successful translation of data into public good.
In Gaussian graphical models, the likelihood equations must typically be solved iteratively. We investigate two algorithms: A version of iterative proportional scaling which avoids inversion of large matrices, and an algorithm based on convex duality and operating on the covariance matrix by neighbourhood coordinate descent, corresponding to the graphical lasso with zero penalty. For large, sparse graphs, the iterative proportional scaling algorithm appears feasible and has simple convergence properties. The algorithm based on neighbourhood coordinate descent is extremely fast and less dependent on sparsity, but needs a positive definite starting value to converge. We give an algorithm for finding such a starting value for graphs with low colouring number. As a consequence, we also obtain a simplified proof for existence of the maximum likelihood estimator in such cases.
This study examines the varying coefficient model in tail index regression. The varying coefficient model is an efficient semiparametric model that avoids the curse of dimensionality when including large covariates in the model. In fact, the varying coefficient model is useful in mean, quantile, and other regressions. The tail index regression is not an exception. However, the varying coefficient model is flexible, but leaner and simpler models are preferred for applications. Therefore, it is important to evaluate whether the estimated coefficient function varies significantly with covariates. If the effect of the non-linearity of the model is weak, the varying coefficient structure is reduced to a simpler model, such as a constant or zero. Accordingly, the hypothesis test for model assessment in the varying coefficient model has been discussed in mean and quantile regression. However, there are no results in tail index regression. In this study, we investigate the asymptotic properties of an estimator and provide a hypothesis testing method for varying coefficient models for tail index regression.