In practice functional data are sampled on a discrete set of observation points and often susceptible to noise. We consider in this paper the setting where such data are used as explanatory variables in a regression problem. If the primary goal is prediction, we show that the gain by embedding the problem into a scalar-on-function regression is limited. Instead we impose a factor model on the predictors and suggest regressing the response on an appropriate number of factor scores. This approach is shown to be consistent under mild technical assumptions, numerically efficient and gives good practical performance in both simulations as well as real data settings.
We develop methodology for testing hypotheses regarding the slope function in functional linear regression for time series via a reproducing kernel Hilbert space approach. In contrast to most of the literature, which considers tests for the exact nullity of the slope function, we are interested in the null hypothesis that the slope function vanishes only approximately, where deviations are measured with respect to the $L^2$-norm. An asymptotically pivotal test is proposed, which does not require the estimation of nuisance parameters and long-run covariances. The key technical tools to prove the validity of our approach include a uniform Bahadur representation and a weak invariance principle for a sequential process of estimates of the slope function. Both scalar-on-function and function-on-function linear regression are considered and finite-sample methods for implementing our methodology are provided. We also illustrate the potential of our methods by means of a small simulation study and a data example.
In the context of principal components analysis (PCA), the bootstrap is commonly applied to solve a variety of inference problems, such as constructing confidence intervals for the eigenvalues of the population covariance matrix $\Sigma$. However, when the data are high-dimensional, there are relatively few theoretical guarantees that quantify the performance of the bootstrap. Our aim in this paper is to analyze how well the bootstrap can approximate the joint distribution of the leading eigenvalues of the sample covariance matrix $\hat\Sigma$, and we establish non-asymptotic rates of approximation with respect to the multivariate Kolmogorov metric. Under certain assumptions, we show that the bootstrap can achieve the dimension-free rate of ${\tt{r}}(\Sigma)/\sqrt n$ up to logarithmic factors, where ${\tt{r}}(\Sigma)$ is the effective rank of $\Sigma$, and $n$ is the sample size. From a methodological standpoint, our work also illustrates that applying a transformation to the eigenvalues of $\hat\Sigma$ before bootstrapping is an important consideration in high-dimensional settings.
When evaluating and comparing models using leave-one-out cross-validation (LOO-CV), the uncertainty of the estimate is typically assessed using the variance of the sampling distribution. Considering the uncertainty is important, as the variability of the estimate can be high in some cases. An important result by Bengio and Grandvalet (2004) states that no general unbiased variance estimator can be constructed, that would apply for any utility or loss measure and any model. We show that it is possible to construct an unbiased estimator considering a specific predictive performance measure and model. We demonstrate an unbiased sampling distribution variance estimator for the Bayesian normal model with fixed model variance using the expected log pointwise predictive density (elpd) utility score. This example demonstrates that it is possible to obtain improved, problem-specific, unbiased estimators for assessing the uncertainty in LOO-CV estimation.
Pixel Value Ordering (PVO) holds an impressive property for high fidelity Reversible Data Hiding (RDH). In this paper, we introduce a dual-PVO (dPVO) for Prediction Error Expansion(PEE), and thereby develop a new RDH scheme to offer a better rate-distortion performance. Particularly, we propose to embed in two phases: forward and backward. In the forward phase, PVO with classic PEE is applied to every non-overlapping image block of size 1x3. In the backward phase,minimum-set and maximum-set of pixels are determined from the pixels predicted in the forward phase. The minimum set only contains the lowest predicted pixels and the maximum set contains the largest predicted pixels of each image block. Proposed dPVO withPEE is then applied to both sets, so that the pixel values of the minimum set are increased and that of the maximum set are decreased by a unit value. Thereby, the pixels predicted in the forward embedding can partially be restored to their original values resulting in both better-embedded image quality and a higher embedding rate. Experimental results have recorded a promising rate-distortion performance of our scheme with a significant improvement of embedded image quality at higher embedding rates compared to the popular and state-of-the-art PVO-based RDHschemes.
Two important considerations in clinical research studies are proper evaluations of internal and external validity. While randomized clinical trials can overcome several threats to internal validity, they may be prone to poor external validity. Conversely, large prospective observational studies sampled from a broadly generalizable population may be externally valid, yet susceptible to threats to internal validity, particularly confounding. Thus, methods that address confounding and enhance transportability of study results across populations are essential for internally and externally valid causal inference, respectively. These issues persist for another problem closely related to transportability known as data-fusion. We develop a calibration method to generate balancing weights that address confounding and sampling bias, thereby enabling valid estimation of the target population average treatment effect. We compare the calibration approach to two additional doubly-robust methods that estimate the effect of an intervention on an outcome within a second, possibly unrelated target population. The proposed methodologies can be extended to resolve data-fusion problems that seek to evaluate the effects of an intervention using data from two related studies sampled from different populations. A simulation study is conducted to demonstrate the advantages and similarities of the different techniques. We also test the performance of the calibration approach in a motivating real data example comparing whether the effect of biguanides versus sulfonylureas - the two most common oral diabetes medication classes for initial treatment - on all-cause mortality described in a historical cohort applied to a contemporary cohort of US Veterans with diabetes.
We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE), with linear regression models. As the development of machine learning for causal inference, a wide range of large-scale models for causality are gaining attention. One problem is that suspicions have been raised that the large-scale models are prone to overfitting to observations with sample selection, hence the large models may not be suitable for causal prediction. In this study, to resolve the suspicious, we investigate on the validity of causal inference methods for overparameterized models, by applying the recent theory of benign overfitting (Bartlett et al., 2020). Specifically, we consider samples whose distribution switches depending on an assignment rule, and study the prediction of CATE with linear models whose dimension diverges to infinity. We focus on two methods: the T-learner, which based on a difference between separately constructed estimators with each treatment group, and the inverse probability weight (IPW)-learner, which solves another regression problem approximated by a propensity score. In both methods, the estimator consists of interpolators that fit the samples perfectly. As a result, we show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known. This difference stems from that the T-learner is unable to preserve eigenspaces of the covariances, which is necessary for benign overfitting in the overparameterized setting. Our result provides new insights into the usage of causal inference methods in the overparameterizated setting, in particular, doubly robust estimators.
Compared to the nominal scale, the ordinal scale for a categorical outcome variable has the property of making a monotonicity assumption for the covariate effects meaningful. This assumption is encoded in the commonly used proportional odds model, but there it is combined with other parametric assumptions such as linearity and additivity. Herein, the considered models are non-parametric and the only condition imposed is that the effects of the covariates on the outcome categories are stochastically monotone according to the ordinal scale. We are not aware of the existence of other comparable multivariable models that would be suitable for inference purposes. We generalize our previously proposed Bayesian monotonic multivariable regression model to ordinal outcomes, and propose an estimation procedure based on reversible jump Markov chain Monte Carlo. The model is based on a marked point process construction, which allows it to approximate arbitrary monotonic regression function shapes, and has a built-in covariate selection property. We study the performance of the proposed approach through extensive simulation studies, and demonstrate its practical application in two real data examples.
This article focuses on the problem of predicting a response variable based on a network-valued predictor. Our motivation is the development of interpretable and accurate predictive models for cognitive traits and neuro-psychiatric disorders based on an individual's brain connection network (connectome). Current methods reduce the complex, high dimensional brain network into low-dimensional pre-specified features prior to applying standard predictive algorithms. These methods are sensitive to feature choice and inevitably discard important information. Instead, we propose a nonparametric Bayes class of models that utilize the entire adjacency matrix defining brain region connections to adaptively detect predictive algorithms, while maintaining interpretability. The Bayesian Connectomics (BaCon) model class utilizes Poisson-Dirichlet processes to find a lower-dimensional, bidirectional (covariate, subject) pattern in the adjacency matrix. The small n, large p problem is transformed into a "small n, small q" problem, facilitating an effective stochastic search of the predictors. A spike-and-slab prior for the cluster predictors strikes a balance between regression model parsimony and flexibility, resulting in improved inferences and test case predictions. We describe basic properties of the BaCon model and develop efficient algorithms for posterior computation. The resulting methods are found to outperform existing approaches and applied to a creative reasoning data set.
Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domain-specific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately, and develop methods for OOD prediction from a single training domain, which is common and challenging. The methods are based on the causal invariance principle, with a novel design for both efficient learning and easy prediction. Theoretically, we prove that under certain conditions, CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error and the success of adaptation. Empirical study shows improved OOD performance over prevailing baselines.
In this paper we introduce a covariance framework for the analysis of EEG and MEG data that takes into account observed temporal stationarity on small time scales and trial-to-trial variations. We formulate a model for the covariance matrix, which is a Kronecker product of three components that correspond to space, time and epochs/trials, and consider maximum likelihood estimation of the unknown parameter values. An iterative algorithm that finds approximations of the maximum likelihood estimates is proposed. We perform a simulation study to assess the performance of the estimator and investigate the influence of different assumptions about the covariance factors on the estimated covariance matrix and on its components. Apart from that, we illustrate our method on real EEG and MEG data sets. The proposed covariance model is applicable in a variety of cases where spontaneous EEG or MEG acts as source of noise and realistic noise covariance estimates are needed for accurate dipole localization, such as in evoked activity studies, or where the properties of spontaneous EEG or MEG are themselves the topic of interest, such as in combined EEG/fMRI experiments in which the correlation between EEG and fMRI signals is investigated.