Regression models for compositional data are common in several areas of knowledge. As in other classes of regression models, it is desirable to perform diagnostic analysis in these models using residuals that are approximately standard normally distributed. However, for regression models for compositional data, there has not been any multivariate residual that meets this requirement. In this work, we introduce a class of asymptotically standard normally distributed residuals for compositional data based on bootstrap. Monte Carlo simulation studies indicate that the distributions of the residuals of this class are well approximated by the standard normal distribution in small samples. An application to simulated data also suggests that one of the residuals of the proposed class is better to identify model misspecification than its competitors. Finally, the usefulness of the best residual of the proposed class is illustrated through an application on sleep stages. The class of residuals proposed here can also be used in other classes of multivariate regression models.
As a generalization of graphs, hypergraphs are widely used to model higher-order relations in data. This paper explores the benefit of the hypergraph structure for the interpolation of point cloud data that contain no explicit structural information. We define the $\varepsilon_n$-ball hypergraph and the $k_n$-nearest neighbor hypergraph on a point cloud and study the $p$-Laplacian regularization on the hypergraphs. We prove the variational consistency between the hypergraph $p$-Laplacian regularization and the continuum $p$-Laplacian regularization in a semisupervised setting when the number of points $n$ goes to infinity while the number of labeled points remains fixed. A key improvement compared to the graph case is that the results rely on weaker assumptions on the upper bound of $\varepsilon_n$ and $k_n$. To solve the convex but non-differentiable large-scale optimization problem, we utilize the stochastic primal-dual hybrid gradient algorithm. Numerical experiments on data interpolation verify that the hypergraph $p$-Laplacian regularization outperforms the graph $p$-Laplacian regularization in preventing the development of spikes at the labeled points.
Linkage methods are among the most popular algorithms for hierarchical clustering. Despite their relevance the current knowledge regarding the quality of the clustering produced by these methods is limited. Here, we improve the currently available bounds on the maximum diameter of the clustering obtained by complete-link for metric spaces. One of our new bounds, in contrast to the existing ones, allows us to separate complete-link from single-link in terms of approximation for the diameter, which corroborates the common perception that the former is more suitable than the latter when the goal is producing compact clusters. We also show that our techniques can be employed to derive upper bounds on the cohesion of a class of linkage methods that includes the quite popular average-link.
We present a novel approach for modeling bounded count time series data, by deriving accurate upper and lower bounds for the variance of a bounded count random variable while maintaining a fixed mean. Leveraging these bounds, we propose semiparametric mean and variance joint (MVJ) models utilizing a clipped-Laplace link function. These models offer a flexible and feasible structure for both mean and variance, accommodating various scenarios of under-dispersion, equi-dispersion, or over-dispersion in bounded time series. The proposed MVJ models feature a linear mean structure with positive regression coefficients summing to one and allow for negative regression cefficients and autocorrelations. We demonstrate that the autocorrelation structure of MVJ models mirrors that of an autoregressive moving-average (ARMA) process, provided the proposed clipped-Laplace link functions with nonnegative regression coefficients summing to one are utilized. We establish conditions ensuring the stationarity and ergodicity properties of the MVJ process, along with demonstrating the consistency and asymptotic normality of the conditional least squares estimators. To aid model selection and diagnostics, we introduce two model selection criteria and apply two model diagnostics statistics. Finally, we conduct simulations and real data analyses to investigate the finite-sample properties of the proposed MVJ models, providing insights into their efficacy and applicability in practical scenarios.
Capturing the extremal behaviour of data often requires bespoke marginal and dependence models which are grounded in rigorous asymptotic theory, and hence provide reliable extrapolation into the upper tails of the data-generating distribution. We present a toolbox of four methodological frameworks, motivated by modern extreme value theory, that can be used to accurately estimate extreme exceedance probabilities or the corresponding level in either a univariate or multivariate setting. Our frameworks were used to facilitate the winning contribution of Team Yalla to the EVA (2023) Conference Data Challenge, which was organised for the 13$^\text{th}$ International Conference on Extreme Value Analysis. This competition comprised seven teams competing across four separate sub-challenges, with each requiring the modelling of data simulated from known, yet highly complex, statistical distributions, and extrapolation far beyond the range of the available samples in order to predict probabilities of extreme events. Data were constructed to be representative of real environmental data, sampled from the fantasy country of "Utopia"
Detecting multiple unknown objects in noisy data is a key problem in many scientific fields, such as electron microscopy imaging. A common model for the unknown objects is the linear subspace model, which assumes that the objects can be expanded in some known basis (such as the Fourier basis). In this paper, we develop an object detection algorithm that under the linear subspace model is asymptotically guaranteed to detect all objects, while controlling the family wise error rate or the false discovery rate. Numerical simulations show that the algorithm also controls the error rate with high power in the non-asymptotic regime, even in highly challenging regimes. We apply the proposed algorithm to experimental electron microscopy data set, and show that it outperforms existing standard software.
We develop an inferential toolkit for analyzing object-valued responses, which correspond to data situated in general metric spaces, paired with Euclidean predictors within the conformal framework. To this end we introduce conditional profile average transport costs, where we compare distance profiles that correspond to one-dimensional distributions of probability mass falling into balls of increasing radius through the optimal transport cost when moving from one distance profile to another. The average transport cost to transport a given distance profile to all others is crucial for statistical inference in metric spaces and underpins the proposed conditional profile scores. A key feature of the proposed approach is to utilize the distribution of conditional profile average transport costs as conformity score for general metric space-valued responses, which facilitates the construction of prediction sets by the split conformal algorithm. We derive the uniform convergence rate of the proposed conformity score estimators and establish asymptotic conditional validity for the prediction sets. The finite sample performance for synthetic data in various metric spaces demonstrates that the proposed conditional profile score outperforms existing methods in terms of both coverage level and size of the resulting prediction sets, even in the special case of scalar and thus Euclidean responses. We also demonstrate the practical utility of conditional profile scores for network data from New York taxi trips and for compositional data reflecting energy sourcing of U.S. states.
Linear structural vector autoregressive models can be identified statistically without imposing restrictions on the model if the shocks are mutually independent and at most one of them is Gaussian. We show that this result extends to structural threshold and smooth transition vector autoregressive models incorporating a time-varying impact matrix defined as a weighted sum of the impact matrices of the regimes. Our empirical application studies the effects of the climate policy uncertainty shock on the U.S. macroeconomy. In a structural logistic smooth transition vector autoregressive model consisting of two regimes, we find that a positive climate policy uncertainty shock decreases production in times of low economic policy uncertainty but slightly increases it in times of high economic policy uncertainty. The introduced methods are implemented to the accompanying R package sstvars.
Risk prediction models are increasingly used in healthcare to aid in clinical decision making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with respect to the modeled outcome (i.e., individuals with vs. without the event of interest are not equally represented in the data). It is common for researchers to correct this class imbalance, yet, the effect of such imbalance corrections on the calibration of machine learning models is largely unknown. We studied the effect of imbalance corrections on model calibration for a variety of machine learning algorithms. Using extensive Monte Carlo simulations we compared the out-of-sample predictive performance of models developed with an imbalance correction to those developed without a correction for class imbalance across different data-generating scenarios (varying sample size, the number of predictors and event fraction). Our findings were illustrated in a case study using MIMIC-III data. In all simulation scenarios, prediction models developed without a correction for class imbalance consistently had equal or better calibration performance than prediction models developed with a correction for class imbalance. The miscalibration introduced by correcting for class imbalance was characterized by an over-estimation of risk and was not always able to be corrected with re-calibration. Correcting for class imbalance is not always necessary and may even be harmful for clinical prediction models which aim to produce reliable risk estimates on an individual basis.
Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.
In real life, data are often of poor quality as a result, for instance, of uncertainty, mismeasurements, missing values or bad inputs. This issue hampers an implicit yet crucial operation of every database management system: equality testing. Indeed, equality is, in the end, a context-dependent operation with a plethora of interpretations. In practice, the treatment of different types of equality is left to programmers, who have to struggle with those interpretations in their code. We propose a new lattice-based declarative framework to address this problem. It allows specification of numerous semantics for equality at a high level of abstraction. To go beyond tuple equality, we study functional dependencies (FDs) in the light of our framework. First, we define abstract FDs, generalizing classical FDs. These lead to the consideration of particular interpretations of equality: realities. Building upon realities and possible/certain answers, we introduce possible/certain FDs and give some related complexity results.