The Horvitz-Thompson (HT), the Rao-Hartley-Cochran (RHC) and the generalized regression (GREG) estimators of the finite population mean are considered, when the observations are from an infinite dimensional space. We compare these estimators based on their asymptotic distributions under some commonly used sampling designs and some superpopulations satisfying linear regression models. We show that the GREG estimator is asymptotically at least as efficient as any of the other two estimators under different sampling designs considered in this paper. Further, we show that the use of some well known sampling designs utilizing auxiliary information may have an adverse effect on the performance of the GREG estimator, when the degree of heteroscedasticity present in linear regression models is not very large. On the other hand, the use of those sampling designs improves the performance of this estimator, when the degree of heteroscedasticity present in linear regression models is large. We develop methods for determining the degree of heteroscedasticity, which in turn determines the choice of appropriate sampling design to be used with the GREG estimator. We also investigate the consistency of the covariance operators of the above estimators. We carry out some numerical studies using real and synthetic data, and our theoretical results are supported by the results obtained from those numerical studies.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
Bayesian network (BN) structure discovery algorithms typically either make assumptions about the sparsity of the true underlying network, or are limited by computational constraints to networks with a small number of variables. While these sparsity assumptions can take various forms, frequently the assumptions focus on an upper bound for the maximum in-degree of the underlying graph $\nabla_G$. Theorem 2 in Duttweiler et. al. (2023) demonstrates that the largest eigenvalue of the normalized inverse covariance matrix ($\Omega$) of a linear BN is a lower bound for $\nabla_G$. Building on this result, this paper provides the asymptotic properties of, and a debiasing procedure for, the sample eigenvalues of $\Omega$, leading to a hypothesis test that may be used to determine if the BN has max in-degree greater than 1. A linear BN structure discovery workflow is suggested in which the investigator uses this hypothesis test to aid in selecting an appropriate structure discovery algorithm. The hypothesis test performance is evaluated through simulations and the workflow is demonstrated on data from a human psoriasis study.
Despite the progress in medical data collection the actual burden of SARS-CoV-2 remains unknown due to under-ascertainment of cases. This was apparent in the acute phase of the pandemic and the use of reported deaths has been pointed out as a more reliable source of information, likely less prone to under-reporting. Since daily deaths occur from past infections weighted by their probability of death, one may infer the total number of infections accounting for their age distribution, using the data on reported deaths. We adopt this framework and assume that the dynamics generating the total number of infections can be described by a continuous time transmission model expressed through a system of non-linear ordinary differential equations where the transmission rate is modelled as a diffusion process allowing to reveal both the effect of control strategies and the changes in individuals behavior. We develop this flexible Bayesian tool in Stan and study 3 pairs of European countries, estimating the time-varying reproduction number($R_t$) as well as the true cumulative number of infected individuals. As we estimate the true number of infections we offer a more accurate estimate of $R_t$. We also provide an estimate of the daily reporting ratio and discuss the effects of changes in mobility and testing on the inferred quantities.
Population-based structural health monitoring (PBSHM) aims to share valuable information among members of a population, such as normal- and damage-condition data, to improve inferences regarding the health states of the members. Even when the population is comprised of nominally-identical structures, benign variations among the members will exist as a result of slight differences in material properties, geometry, boundary conditions, or environmental effects (e.g., temperature changes). These discrepancies can affect modal properties and present as changes in the characteristics of the resonance peaks of the frequency response function (FRF). Many SHM strategies depend on monitoring the dynamic properties of structures, so benign variations can be challenging for the practical implementation of these systems. Another common challenge with vibration-based SHM is data loss, which may result from transmission issues, sensor failure, a sample-rate mismatch between sensors, and other causes. Missing data in the time domain will result in decreased resolution in the frequency domain, which can impair dynamic characterisation. The hierarchical Bayesian approach provides a useful modelling structure for PBSHM, because statistical distributions at the population and individual (or domain) level are learnt simultaneously to bolster statistical strength among the parameters. As a result, variance is reduced among the parameter estimates, particularly when data are limited. In this paper, combined probabilistic FRF models are developed for a small population of nominally-identical helicopter blades under varying temperature conditions, using a hierarchical Bayesian structure. These models address critical challenges in SHM, by accommodating benign variations that present as differences in the underlying dynamics, while also considering (and utilising), the similarities among the blades.
Gaussian Process Networks (GPNs) are a class of directed graphical models which employ Gaussian processes as priors for the conditional expectation of each variable given its parents in the network. The model allows describing continuous joint distributions in a compact but flexible manner with minimal parametric assumptions on the dependencies between variables. Bayesian structure learning of GPNs requires computing the posterior over graphs of the network and is computationally infeasible even in low dimensions. This work implements Monte Carlo and Markov Chain Monte Carlo methods to sample from the posterior distribution of network structures. As such, the approach follows the Bayesian paradigm, comparing models via their marginal likelihood and computing the posterior probability of the GPN features. Simulation studies show that our method outperforms state-of-the-art algorithms in recovering the graphical structure of the network and provides an accurate approximation of its posterior distribution.
We consider the estimation of factor model-based variance-covariance matrix when the factor loading matrix is assumed sparse. To do so, we rely on a system of penalized estimating functions to account for the identification issue of the factor loading matrix while fostering sparsity in potentially all its entries. We prove the oracle property of the penalized estimator for the factor model when the dimension is fixed. That is, the penalization procedure can recover the true sparse support, and the estimator is asymptotically normally distributed. Consistency and recovery of the true zero entries are established when the number of parameters is diverging. These theoretical results are supported by simulation experiments, and the relevance of the proposed method is illustrated by an application to portfolio allocation.
The heterogeneous, geographically distributed infrastructure of fog computing poses challenges in data replication, data distribution, and data mobility for fog applications. Fog computing is still missing the necessary abstractions to manage application data, and fog application developers need to re-implement data management for every new piece of software. Proposed solutions are limited to certain application domains, such as the IoT, are not flexible in regard to network topology, or do not provide the means for applications to control the movement of their data. In this paper, we present FReD, a data replication middleware for the fog. FReD serves as a building block for configurable fog data distribution and enables low-latency, high-bandwidth, and privacy-sensitive applications. FReD is a common data access interface across heterogeneous infrastructure and network topologies, provides transparent and controllable data distribution, and can be integrated with applications from different domains. To evaluate our approach, we present a prototype implementation of FReD and show the benefits of developing with FReD using three case studies of fog computing applications.
Bayesian Optimization (BO) is a class of black-box, surrogate-based heuristics that can efficiently optimize problems that are expensive to evaluate, and hence admit only small evaluation budgets. BO is particularly popular for solving numerical optimization problems in industry, where the evaluation of objective functions often relies on time-consuming simulations or physical experiments. However, many industrial problems depend on a large number of parameters. This poses a challenge for BO algorithms, whose performance is often reported to suffer when the dimension grows beyond 15 variables. Although many new algorithms have been proposed to address this problem, it is not well understood which one is the best for which optimization scenario. In this work, we compare five state-of-the-art high-dimensional BO algorithms, with vanilla BO and CMA-ES on the 24 BBOB functions of the COCO environment at increasing dimensionality, ranging from 10 to 60 variables. Our results confirm the superiority of BO over CMA-ES for limited evaluation budgets and suggest that the most promising approach to improve BO is the use of trust regions. However, we also observe significant performance differences for different function landscapes and budget exploitation phases, indicating improvement potential, e.g., through hybridization of algorithmic components.
The state of the art related to parameter correlation in two-parameter models has been reviewed in this paper. The apparent contradictions between the different authors regarding the ability of D--optimality to simultaneously reduce the correlation and the area of the confidence ellipse in two-parameter models were analyzed. Two main approaches were found: 1) those who consider that the optimality criteria simultaneously control the precision and correlation of the parameter estimators; and 2) those that consider a combination of criteria to achieve the same objective. An analytical criterion combining in its structure both the optimality of the precision of the estimators of the parameters and the reduction of the correlation between their estimators is provided. The criterion was tested both in a simple linear regression model, considering all possible design spaces, and in a non-linear model with strong correlation of the estimators of the parameters (Michaelis--Menten) to show its performance. This criterion showed a superior behavior to all the strategies and criteria to control at the same time the precision and the correlation.
The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points. Yet, they can be inaccurate under high-dimensional noise, especially if the noise magnitude varies considerably across the data, e.g., under heteroskedasticity or outliers. In this work, we investigate a more robust alternative -- the doubly stochastic normalization of the Gaussian kernel. We consider a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish that the doubly stochastic affinity matrix and its scaling factors concentrate around certain population forms, and provide corresponding finite-sample probabilistic error bounds. We then utilize these results to develop several tools for robust inference under general high-dimensional noise. First, we derive a robust density estimator that reliably infers the underlying sampling density and can substantially outperform the standard kernel density estimator under heteroskedasticity and outliers. Second, we obtain estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that accurately approximate various manifold Laplacians, including the Laplace Beltrami operator, improving over traditional normalizations in noisy settings. We exemplify our results in simulations and on real single-cell RNA-sequencing data. For the latter, we show that in contrast to traditional methods, our approach is robust to variability in technical noise levels across cell types.