We consider the estimation of measures of model performance in a target population when covariate and outcome data are available on a sample from some source population and covariate data, but not outcome data, are available on a simple random sample from the target population. When outcome data are not available from the target population, identification of measures of model performance is possible under an untestable assumption that the outcome and population (source or target population) are independent conditional on covariates. In practice, this assumption is uncertain and, in some cases, controversial. Therefore, sensitivity analysis may be useful for examining the impact of assumption violations on inferences about model performance. Here, we propose an exponential tilt sensitivity analysis model and develop statistical methods to determine how sensitive measures of model performance are to violations of the assumption of conditional independence between outcome and population. We provide identification results and estimators for the risk in the target population, examine the large-sample properties of the estimators, and apply the estimators to data on individuals with stable ischemic heart disease.
"The rich are getting richer" implies that the population income distributions are getting more right skewed and heavily tailed. For such distributions, the mean is not the best measure of the center, but the classical indices of income inequality, including the celebrated Gini index, are all mean-based. In view of this, Professor Gastwirth sounded an alarm back in 2014 by suggesting to incorporate the median into the definition of the Gini index, although noted a few shortcomings of his proposed index. In the present paper we make a further step in the modification of classical indices and, to acknowledge the possibility of differing viewpoints, arrive at three median-based indices of inequality. They avoid the shortcomings of the previous indices and can be used even when populations are ultra heavily tailed, that is, when their first moments are infinite. The new indices are illustrated both analytically and numerically using parametric families of income distributions, and further illustrated using capital incomes coming from 2001 and 2018 surveys of fifteen European countries. We also discuss the performance of the indices from the perspective of income transfers.
Partially linear additive models generalize linear ones since they model the relation between a response variable and covariates by assuming that some covariates have a linear relation with the response but each of the others enter through unknown univariate smooth functions. The harmful effect of outliers either in the residuals or in the covariates involved in the linear component has been described in the situation of partially linear models, that is, when only one nonparametric component is involved in the model. When dealing with additive components, the problem of providing reliable estimators when atypical data arise, is of practical importance motivating the need of robust procedures. Hence, we propose a family of robust estimators for partially linear additive models by combining $B-$splines with robust linear regression estimators. We obtain consistency results, rates of convergence and asymptotic normality for the linear components, under mild assumptions. A Monte Carlo study is carried out to compare the performance of the robust proposal with its classical counterpart under different models and contamination schemes. The numerical experiments show the advantage of the proposed methodology for finite samples. We also illustrate the usefulness of the proposed approach on a real data set.
A population-averaged additive subdistribution hazards model is proposed to assess the marginal effects of covariates on the cumulative incidence function and to analyze correlated failure time data subject to competing risks. This approach extends the population-averaged additive hazards model by accommodating potentially dependent censoring due to competing events other than the event of interest. Assuming an independent working correlation structure, an estimating equations approach is outlined to estimate the regression coefficients and a new sandwich variance estimator is proposed. The proposed sandwich variance estimator accounts for both the correlations between failure times and between the censoring times, and is robust to misspecification of the unknown dependency structure within each cluster. We further develop goodness-of-fit tests to assess the adequacy of the additive structure of the subdistribution hazards for the overall model and each covariate. Simulation studies are conducted to investigate the performance of the proposed methods in finite samples. We illustrate our methods using data from the STrategies to Reduce Injuries and Develop confidence in Elders (STRIDE) trial.
Occupancy models are frequently used by ecologists to quantify spatial variation in species distributions while accounting for observational biases in the collection of detection-nondetection data. However, the common assumption that a single set of regression coefficients can adequately explain species-environment relationships is often unrealistic, especially across large spatial domains. Here we develop single-species (i.e., univariate) and multi-species (i.e., multivariate) spatially-varying coefficient (SVC) occupancy models to account for spatially-varying species-environment relationships. We employ Nearest Neighbor Gaussian Processes and Polya-Gamma data augmentation in a hierarchical Bayesian framework to yield computationally efficient Gibbs samplers, which we implement in the spOccupancy R package. For multi-species models, we use spatial factor dimension reduction to efficiently model datasets with large numbers of species (e.g., > 10). The hierarchical Bayesian framework readily enables generation of posterior predictive maps of the SVCs, with fully propagated uncertainty. We apply our SVC models to quantify spatial variability in the relationships between maximum breeding season temperature and occurrence probability of 21 grassland bird species across the U.S. Jointly modeling species generally outperformed single-species models, which all revealed substantial spatial variability in species occurrence relationships with maximum temperatures. Our models are particularly relevant for quantifying species-environment relationships using detection-nondetection data from large-scale monitoring programs, which are becoming increasingly prevalent for answering macroscale ecological questions regarding wildlife responses to global change.
Cooperative Intelligent Transport Systems (C-ITS) create, share and process massive amounts of data which needs to be real-time managed to enable new cooperative and autonomous driving applications. Vehicle-to-Everything (V2X) communications facilitate information exchange among vehicles and infrastructures using various protocols. By providing computer power, data storage, and low latency capabilities, Multi-access Edge Computing (MEC) has become a key enabling technology in the transport industry. The Local Dynamic Map (LDM) concept has consequently been extended to its utilisation in MECs, into an efficient, collaborative, and centralised Edge Dynamic Map (EDM) for C-ITS applications. This research presents an EDM architecture for V2X communications and implements a real-time proof-of-concept using a Time-Series Database (TSDB) engine to store vehicular message information. The performance evaluation includes data insertion and querying, assessing the system's capacity and scale for low-latency Cooperative Awareness Message (CAM) applications. Traffic simulations using SUMO have been employed to generate virtual routes for thousands of vehicles, demonstrating the transmission of virtual CAM messages to the EDM.
Linear regression and classification models with repeated functional data are considered. For each statistical unit in the sample, a real-valued parameter is observed over time under different conditions. Two regression models based on fusion penalties are presented. The first one is a generalization of the variable fusion model based on the 1-nearest neighbor. The second one, called group fusion lasso, assumes some grouping structure of conditions and allows for homogeneity among the regression coefficient functions within groups. A finite sample numerical simulation and an application on EEG data are presented.
Interest in the network analysis of bibliographic data has increased significantly in recent years. Yet, appropriate statistical models for examining the full dynamics of scientific citation networks, connecting authors to the papers they write and papers to other papers they cite, are not available. Very few studies exist that have examined how the social network between co-authors and the citation network among the papers shape one another and co-evolve. In consequence, our understanding of scientific citation networks remains incomplete. In this paper we extend recently derived relational hyperevent models (RHEM) to the analysis of scientific networks, providing a general framework to model the multiple dependencies involved in the relation linking multiple authors to the papers they write, and papers to the multiple references they cite. We demonstrate the empirical value of our model in an analysis of publicly available data on a scientific network comprising millions of authors and papers and assess the relative strength of various effects explaining scientific production. We outline the implications of the model for the evaluation of scientific research.
We consider the degree-Rips construction from topological data analysis, which provides a density-sensitive, multiparameter hierarchical clustering algorithm. We analyze its stability to perturbations of the input data using the correspondence-interleaving distance, a metric for hierarchical clusterings that we introduce. Taking certain one-parameter slices of degree-Rips recovers well-known methods for density-based clustering, but we show that these methods are unstable. However, we prove that degree-Rips, as a multiparameter object, is stable, and we propose an alternative approach for taking slices of degree-Rips, which yields a one-parameter hierarchical clustering algorithm with better stability properties. We prove that this algorithm is consistent, using the correspondence-interleaving distance. We provide an algorithm for extracting a single clustering from one-parameter hierarchical clusterings, which is stable with respect to the correspondence-interleaving distance. And, we integrate these methods into a pipeline for density-based clustering, which we call Persistable. Adapting tools from multiparameter persistent homology, we propose visualization tools that guide the selection of all parameters of the pipeline. We demonstrate Persistable on benchmark datasets, showing that it identifies multi-scale cluster structure in data.
In the context of robotics, accurate ground truth positioning is essential for the development of Simultaneous Localization and Mapping (SLAM) and control algorithms. Robotic Total Stations (RTSs) provide accurate and precise reference positions in different types of outdoor environments, especially when compared to the limited accuracy of Global Navigation Satellite System (GNSS) in cluttered areas. Three RTSs give the possibility to obtain the six-Degrees Of Freedom (DOF) reference pose of a robotic platform. However, the uncertainty of every pose is rarely computed for trajectory evaluation. As evaluation algorithms are getting increasingly precise, it becomes crucial to take into account this uncertainty. We propose a method to compute this six-DOF uncertainty from the fusion of three RTSs based on Monte Carlo (MC) methods. This solution relies on point-to-point minimization to propagate the noise of RTSs on the pose of the robotic platform. Five main noise sources are identified to model this uncertainty: noise inherent to the instrument, tilt noise, atmospheric factors, time synchronization noise, and extrinsic calibration noise. Based on extensive experimental work, we compare the impact of each noise source on the prism uncertainty and the final estimated pose. Tested on more than 50 km of trajectories, our comparison highlighted the importance of the calibration noise and the measurement distance, which should be ideally under 75 m. Moreover, it has been noted that the uncertainty on the pose of the robot is not prominently affected by one particular noise source, compared to the others.
The development of technologies for causal inference with the privacy preservation of distributed data has attracted considerable attention in recent years. To address this issue, we propose a data collaboration quasi-experiment (DC-QE) that enables causal inference from distributed data with privacy preservation. In our method, first, local parties construct dimensionality-reduced intermediate representations from the private data. Second, they share intermediate representations, instead of private data for privacy preservation. Third, propensity scores were estimated from the shared intermediate representations. Finally, the treatment effects were estimated from propensity scores. Our method can reduce both random errors and biases, whereas existing methods can only reduce random errors in the estimation of treatment effects. Through numerical experiments on both artificial and real-world data, we confirmed that our method can lead to better estimation results than individual analyses. Dimensionality-reduction loses some of the information in the private data and causes performance degradation. However, we observed that in the experiments, sharing intermediate representations with many parties to resolve the lack of subjects and covariates, our method improved performance enough to overcome the degradation caused by dimensionality-reduction. With the spread of our method, intermediate representations can be published as open data to help researchers find causalities and accumulated as a knowledge base.