This paper presents a general framework for the estimation of regression models with circular covariates, where the conditional distribution of the response given the covariate can be specified through a parametric model. The estimation of a conditional characteristic is carried out nonparametrically, by maximizing the circular local likelihood, and the estimator is shown to be asymptotically normal. The problem of selecting the smoothing parameter is also addressed, as well as bias and variance computation. The performance of the estimation method in practice is studied through an extensive simulation study, where we cover the cases of Gaussian, Bernoulli, Poisson and Gamma distributed responses. The generality of our approach is illustrated with several real-data examples from different fields.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
Collective Adaptive Systems often consist of many heterogeneous components typically organised in groups. These entities interact with each other by adapting their behaviour to pursue individual or collective goals. In these systems, the distribution of these entities determines a space that can be either physical or logical. The former is defined in terms of a physical relation among components. The latter depends on logical relations, such as being part of the same group. In this context, specification and verification of spatial properties play a fundamental role in supporting the design of systems and predicting their behaviour. For this reason, different tools and techniques have been proposed to specify and verify the properties of space, mainly described as graphs. Therefore, the approaches generally use model spatial relations to describe a form of proximity among pairs of entities. Unfortunately, these graph-based models do not permit considering relations among more than two entities that may arise when one is interested in describing aspects of space by involving interactions among groups of entities. In this work, we propose a spatial logic interpreted on simplicial complexes. These are topological objects, able to represent surfaces and volumes efficiently that generalise graphs with higher-order edges. We discuss how the satisfaction of logical formulas can be verified by a correct and complete model checking algorithm, which is linear to the dimension of the simplicial complex and logical formula. The expressiveness of the proposed logic is studied in terms of the spatial variants of classical bisimulation and branching bisimulation relations defined over simplicial complexes.
Multistate current status (CS) data presents a more severe form of censoring due to the single observation of study participants transitioning through a sequence of well-defined disease states at random inspection times. Moreover, these data may be clustered within specified groups, and informativeness of the cluster sizes may arise due to the existing latent relationship between the transition outcomes and the cluster sizes. Failure to adjust for this informativeness may lead to a biased inference. Motivated by a clinical study of periodontal disease (PD), we propose an extension of the pseudo-value approach to estimate covariate effects on the state occupation probabilities (SOP) for these clustered multistate CS data with informative cluster or intra-cluster group sizes. In our approach, the proposed pseudo-value technique initially computes marginal estimators of the SOP utilizing nonparametric regression. Next, the estimating equations based on the corresponding pseudo-values are reweighted by functions of the cluster sizes to adjust for informativeness. We perform a variety of simulation studies to study the properties of our pseudo-value regression based on the nonparametric marginal estimators under different scenarios of informativeness. For illustration, the method is applied to the motivating PD dataset, which encapsulates the complex data-generation mechanism.
The aim of this paper is to develop estimation and inference methods for the drift parameters of multivariate L\'evy-driven continuous-time autoregressive processes of order $p\in\mathbb{N}$. Starting from a continuous-time observation of the process, we develop consistent and asymptotically normal maximum likelihood estimators. We then relax the unrealistic assumption of continuous-time observation by considering natural discretizations based on a combination of Riemann-sum, finite difference, and thresholding approximations. The resulting estimators are also proven to be consistent and asymptotically normal under a general set of conditions, allowing for both finite and infinite jump activity in the driving L\'evy process. When discretizing the estimators, allowing for irregularly spaced observations is of great practical importance. In this respect, CAR($p$) models are not just relevant for "true" continuous-time processes: a CAR($p$) specification provides a natural continuous-time interpolation for modeling irregularly spaced data - even if the observed process is inherently discrete. As a practically relevant application, we consider the setting where the multivariate observation is known to possess a graphical structure. We refer to such a process as GrCAR and discuss the corresponding drift estimators and their properties. The finite sample behavior of all theoretical asymptotic results is empirically assessed by extensive simulation experiments.
We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
This paper considers the specification of covariance structures with tail estimates. We focus on two aspects: (i) the estimation of the VaR-CoVaR risk matrix in the case of larger number of time series observations than assets in a portfolio using quantile predictive regression models without assuming the presence of nonstationary regressors and; (ii) the construction of a novel variable selection algorithm, so-called, Feature Ordering by Centrality Exclusion (FOCE), which is based on an assumption-lean regression framework, has no tuning parameters and is proved to be consistent under general sparsity assumptions. We illustrate the usefulness of our proposed methodology with numerical studies of real and simulated datasets when modelling systemic risk in a network.
We introduce a multifidelity estimator of covariance matrices formulated as the solution to a regression problem on the manifold of symmetric positive definite matrices. The estimator is positive definite by construction, and the Mahalanobis distance minimized to obtain it possesses properties which enable practical computation. We show that our manifold regression multifidelity (MRMF) covariance estimator is a maximum likelihood estimator under a certain error model on manifold tangent space. More broadly, we show that our Riemannian regression framework encompasses existing multifidelity covariance estimators constructed from control variates. We demonstrate via numerical examples that our estimator can provide significant decreases, up to one order of magnitude, in squared estimation error relative to both single-fidelity and other multifidelity covariance estimators. Furthermore, preservation of positive definiteness ensures that our estimator is compatible with downstream tasks, such as data assimilation and metric learning, in which this property is essential.
Sample reweighting is one of the most widely used methods for correcting the error of least squares learning algorithms in reproducing kernel Hilbert spaces (RKHS), that is caused by future data distributions that are different from the training data distribution. In practical situations, the sample weights are determined by values of the estimated Radon-Nikod\'ym derivative, of the future data distribution w.r.t.~the training data distribution. In this work, we review known error bounds for reweighted kernel regression in RKHS and obtain, by combination, novel results. We show under weak smoothness conditions, that the amount of samples, needed to achieve the same order of accuracy as in the standard supervised learning without differences in data distributions, is smaller than proven by state-of-the-art analyses.
Pre-trained language models (PLMs) serve as backbones for various real-world systems. For high-stake applications, it's equally essential to have reasonable confidence estimations in predictions. While the vanilla confidence scores of PLMs can already be effectively utilized, PLMs consistently become overconfident in their wrong predictions, which is not desirable in practice. Previous work shows that introducing an extra calibration task can mitigate this issue. The basic idea involves acquiring additional data to train models in predicting the confidence of their initial predictions. However, it only demonstrates the feasibility of this kind of method, assuming that there are abundant extra available samples for the introduced calibration task. In this work, we consider the practical scenario that we need to effectively utilize training samples to make PLMs both task-solvers and self-calibrators. Three challenges are presented, including limited training samples, data imbalance, and distribution shifts. We first conduct pilot experiments to quantify various decisive factors in the calibration task. Based on the empirical analysis results, we propose a training algorithm LM-TOAST to tackle the challenges. Experimental results show that LM-TOAST can effectively utilize the training data to make PLMs have reasonable confidence estimations while maintaining the original task performance. Further, we consider three downstream applications, namely selective classification, adversarial defense, and model cascading, to show the practical usefulness of LM-TOAST. The code will be made public at \url{//github.com/Yangyi-Chen/LM-TOAST}.
We consider the problem of discovering $K$ related Gaussian directed acyclic graphs (DAGs), where the involved graph structures share a consistent causal order and sparse unions of supports. Under the multi-task learning setting, we propose a $l_1/l_2$-regularized maximum likelihood estimator (MLE) for learning $K$ linear structural equation models. We theoretically show that the joint estimator, by leveraging data across related tasks, can achieve a better sample complexity for recovering the causal order (or topological order) than separate estimations. Moreover, the joint estimator is able to recover non-identifiable DAGs, by estimating them together with some identifiable DAGs. Lastly, our analysis also shows the consistency of union support recovery of the structures. To allow practical implementation, we design a continuous optimization problem whose optimizer is the same as the joint estimator and can be approximated efficiently by an iterative algorithm. We validate the theoretical analysis and the effectiveness of the joint estimator in experiments.