We propose a Small Area Estimation model based on Generalized Additive Models for Location, Scale and Shape (SAE-GAMLSS), for the estimation of household economic indicators. SAE-GAMLSS release the exponential family distributional assumption and allow each distributional parameter to depend on covariates. A bootstrap approach to estimate MSE is proposed. The SAE-GAMLSS estimator shows a largely better performance than the well-known EBLUP, under various simulated scenarios. Based on SAE-GAMLSS per-capita consumption of Italian and foreign households in Italian regions, in urban and rural areas, is estimated. Results show that the well-known Italian North-South divide does not hold for foreigners.
Scientists continue to develop increasingly complex mechanistic models to reflect their knowledge more realistically. Statistical inference using these models can be challenging since the corresponding likelihood function is often intractable and model simulation may be computationally burdensome. Fortunately, in many of these situations, it is possible to adopt a surrogate model or approximate likelihood function. It may be convenient to conduct Bayesian inference directly with the surrogate, but this can result in bias and poor uncertainty quantification. In this paper we propose a new method for adjusting approximate posterior samples to reduce bias and produce more accurate uncertainty quantification. We do this by optimizing a transform of the approximate posterior that maximizes a scoring rule. Our approach requires only a (fixed) small number of complex model simulations and is numerically stable. We demonstrate good performance of the new method on several examples of increasing complexity.
Vision foundation models are a new frontier in Geospatial Artificial Intelligence (GeoAI), an interdisciplinary research area that applies and extends AI for geospatial problem solving and geographic knowledge discovery, because of their potential to enable powerful image analysis by learning and extracting important image features from vast amounts of geospatial data. This paper evaluates the performance of the first-of-its-kind geospatial foundation model, IBM-NASA's Prithvi, to support a crucial geospatial analysis task: flood inundation mapping. This model is compared with convolutional neural network and vision transformer-based architectures in terms of mapping accuracy for flooded areas. A benchmark dataset, Sen1Floods11, is used in the experiments, and the models' predictability, generalizability, and transferability are evaluated based on both a test dataset and a dataset that is completely unseen by the model. Results show the good transferability of the Prithvi model, highlighting its performance advantages in segmenting flooded areas in previously unseen regions. The findings also indicate areas for improvement for the Prithvi model in terms of adopting multi-scale representation learning, developing more end-to-end pipelines for high-level image analysis tasks, and offering more flexibility in terms of input data bands.
We construct a bipartite generalization of Alon and Szegedy's nearly orthogonal vectors, thereby obtaining strong bounds for several extremal problems involving the Lov\'asz theta function, vector chromatic number, minimum semidefinite rank, nonnegative rank, and extension complexity of polytopes. In particular, we derive some general lower bounds for the vector chromatic number which may be of independent interest.
Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We argue that this lead to a form of instability that lies at the heart of their generative capabilities and that can be described by a set of mean field critical exponents. We conclude by analyzing recent work connecting diffusion models and associative memory networks in view of the thermodynamic formulations.
We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complex-valued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. In contrast to this commonly used approach, we propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech generative modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram auto-encoding and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.
Reinforcement learning suffers from limitations in real practices primarily due to the numbers of required interactions with virtual environments. It results in a challenging problem that we are implausible to obtain an optimal strategy only with a few attempts for many learning method. Hereby, we design an improved reinforcement learning method based on model predictive control that models the environment through a data-driven approach. Based on learned environmental model, it performs multi-step prediction to estimate the value function and optimize the policy. The method demonstrates higher learning efficiency, faster convergent speed of strategies tending to the optimal value, and fewer sample capacity space required by experience replay buffers. Experimental results, both in classic databases and in a dynamic obstacle avoidance scenario for unmanned aerial vehicle, validate the proposed approaches.
Statistical models should accurately reflect analysts' domain knowledge about variables and their relationships. While recent tools let analysts express these assumptions and use them to produce a resulting statistical model, it remains unclear what analysts want to express and how externalization impacts statistical model quality. This paper addresses these gaps. We first conduct an exploratory study of analysts using a domain-specific language (DSL) to express conceptual models. We observe a preference for detailing how variables relate and a desire to allow, and then later resolve, ambiguity in their conceptual models. We leverage these findings to develop rTisane, a DSL for expressing conceptual models augmented with an interactive disambiguation process. In a controlled evaluation, we find that rTisane's DSL helps analysts engage more deeply with and accurately externalize their assumptions. rTisane also leads to statistical models that match analysts' assumptions, maintain analysis intent, and better fit the data.
Classical inequality curves and inequality measures are defined for distributions with finite mean value. Moreover, their empirical counterparts are not resistant to outliers. For these reasons, quantile versions of known inequality curves such as the Lorenz, Bonferroni, Zenga and $D$ curves, and quantile versions of inequality measures such as the Gini, Bonferroni, Zenga and $D$ indices have been proposed in the literature. We propose various nonparametric estimators of quantile versions of inequality curves and inequality measures, prove their consistency, and compare their accuracy in a~simulation study. We also give examples of the use of quantile versions of inequality measures in real data analysis.
Let ${\mathcal P}$ be a family of probability measures on a measurable space $(S,{\mathcal A}).$ Given a Banach space $E,$ a functional $f:E\mapsto {\mathbb R}$ and a mapping $\theta: {\mathcal P}\mapsto E,$ our goal is to estimate $f(\theta(P))$ based on i.i.d. observations $X_1,\dots, X_n\sim P, P\in {\mathcal P}.$ In particular, if ${\mathcal P}=\{P_{\theta}: \theta\in \Theta\}$ is an identifiable statistical model with parameter set $\Theta\subset E,$ one can consider the mapping $\theta(P)=\theta$ for $P\in {\mathcal P}, P=P_{\theta},$ resulting in a problem of estimation of $f(\theta)$ based on i.i.d. observations $X_1,\dots, X_n\sim P_{\theta}, \theta\in \Theta.$ Given a smooth functional $f$ and estimators $\hat \theta_n(X_1,\dots, X_n), n\geq 1$ of $\theta(P),$ we use these estimators, the sample split and the Taylor expansion of $f(\theta(P))$ of a proper order to construct estimators $T_f(X_1,\dots, X_n)$ of $f(\theta(P)).$ For these estimators and for a functional $f$ of smoothness $s\geq 1,$ we prove upper bounds on the $L_p$-errors of estimator $T_f(X_1,\dots, X_n)$ under certain moment assumptions on the base estimators $\hat \theta_n.$ We study the performance of estimators $T_f(X_1,\dots, X_n)$ in several concrete problems, showing their minimax optimality and asymptotic efficiency. In particular, this includes functional estimation in high-dimensional models with many low dimensional components, functional estimation in high-dimensional exponential families and estimation of functionals of covariance operators in infinite-dimensional subgaussian models.