We propose a method of sufficient dimension reduction for functional data using distance covariance. We consider the case where the response variable is a scalar but the predictor is a random function. Our method has several advantages. It requires very mild conditions on the predictor, unlike the existing methods require the restrictive linear conditional mean assumption and constant covariance assumption. It also does not involve the inverse of the covariance operator which is not bounded. The link function between the response and the predictor can be arbitrary and our method maintains the model free advantage without estimating the link function. Moreover, our method is naturally applicable to sparse longitudinal data. We use functional principal component analysis with truncation as the regularization mechanism in the development. The justification for validity of the proposed method is provided and under some regularization conditions, statistical consistency of our estimator is established. Simulation studies and real data analysis are also provided to demonstrate the performance of our method.
We extend Robins' theory of causal inference for complex longitudinal data to the case of continuously varying as opposed to discrete covariates and treatments. In particular we establish versions of the key results of the discrete theory: the g-computation formula and a collection of powerful characterizations of the g-null hypothesis of no treatment effect. This is accomplished under natural continuity hypotheses concerning the conditional distributions of the outcome variable and of the covariates given the past. We also show that our assumptions concerning counterfactual variables place no restriction on the joint distribution of the observed variables: thus in a precise sense, these assumptions are "for free," or if you prefer, harmless.
Motivated by a recent literature on the double-descent phenomenon in machine learning, we consider highly over-parametrized models in causal inference, including synthetic control with many control units. In such models, there may be so many free parameters that the model fits the training data perfectly. As a motivating example, we first investigate high-dimensional linear regression for imputing wage data, where we find that models with many more covariates than sample size can outperform simple ones. As our main contribution, we document the performance of high-dimensional synthetic control estimators with many control units. We find that adding control units can help improve imputation performance even beyond the point where the pre-treatment fit is perfect. We then provide a unified theoretical perspective on the performance of these high-dimensional models. Specifically, we show that more complex models can be interpreted as model-averaging estimators over simpler ones, which we link to an improvement in average performance. This perspective yields concrete insights into the use of synthetic control when control units are many relative to the number of pre-treatment periods.
In non-life insurance, it is essential to understand the serial dynamics and dependence structure of the longitudinal insurance data before using them. Existing actuarial literature primarily focuses on modeling, which typically assumes a lack of serial dynamics and a pre-specified dependence structure of claims across multiple years. To fill in the research gap, we develop two diagnostic tests, namely the serial dynamic test and correlation test, to assess the appropriateness of these assumptions and provide justifiable modeling directions. The tests involve the following ingredients: i) computing the change of the cross-sectional estimated parameters under a logistic regression model and the empirical residual correlations of the claim occurrence indicators across time, which serve as the indications to detect serial dynamics; ii) quantifying estimation uncertainty using the randomly weighted bootstrap approach; iii) developing asymptotic theories to construct proper test statistics. The proposed tests are examined by simulated data and applied to two non-life insurance datasets, revealing that the two datasets behave differently.
This article focuses on the study of lactating sows, where the main interest is the influence of temperature, measured throughout the day, on the lower quantiles of the daily feed intake. We outline a model framework and estimation methodology for quantile regression in scenarios with longitudinal data and functional covariates. The quantile regression model uses a time-varying regression coefficient function to quantify the association between covariates and the quantile level of interest, and it includes subject-specific intercepts to incorporate within-subject dependence. Estimation relies on spline representations of the unknown coefficient functions, and can be carried out with existing software. We introduce bootstrap procedures for bias adjustment and computation of standard errors. Analysis of the lactation data indicates, among others, that the influence of temperature increases during the lactation period.
Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a non-asymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given for both the estimation accuracy and expected length of confidence intervals for the minimizer and minimum. The non-asymptotic local minimax framework brings out new phenomena in simultaneous estimation and inference for the minimizer and minimum. We establish a novel Uncertainty Principle that provides a fundamental limit on how well the minimizer and minimum can be estimated simultaneously for any convex regression function. A similar result holds for the expected length of the confidence intervals for the minimizer and minimum.
Information design in an incomplete information game includes a designer with the goal of influencing players' actions through signals generated from a designed probability distribution so that its objective function is optimized. We consider a setting in which the designer has partial knowledge on agents' utilities. We address the uncertainty about players' preferences by formulating a robust information design problem against worst case payoffs. If the players have quadratic payoffs that depend on the players' actions and an unknown payoff-relevant state, and signals on the state that follow a Gaussian distribution conditional on the state realization, then the information design problem under quadratic design objectives is a semidefinite program (SDP). Specifically, we consider ellipsoid perturbations over payoff coefficients in linear-quadratic-Gaussian (LQG) games. We show that this leads to a tractable robust SDP formulation. Numerical studies are carried out to identify the relation between the perturbation levels and the optimal information structures.
Assessing causal effects in the presence of unmeasured confounding is a challenging problem. Although auxiliary variables, such as instrumental variables, are commonly used to identify causal effects, they are often unavailable in practice due to stringent and untestable conditions. To address this issue, previous researches have utilized linear structural equation models to show that the causal effect can be identifiable when noise variables of the treatment and outcome are both non-Gaussian. In this paper, we investigate the problem of identifying the causal effect using auxiliary covariates and non-Gaussianity from the treatment. Our key idea is to characterize the impact of unmeasured confounders using an observed covariate, assuming they are all Gaussian. The auxiliary covariate can be an invalid instrument or an invalid proxy variable. We demonstrate that the causal effect can be identified using this measured covariate, even when the only source of non-Gaussianity comes from the treatment. We then extend the identification results to the multi-treatment setting and provide sufficient conditions for identification. Based on our identification results, we propose a simple and efficient procedure for calculating causal effects and show the $\sqrt{n}$-consistency of the proposed estimator. Finally, we evaluate the performance of our estimator through simulation studies and an application.
Current methods for pattern analysis in time series mainly rely on statistical features or probabilistic learning and inference methods to identify patterns and trends in the data. Such methods do not generalize well when applied to multivariate, multi-source, state-varying, and noisy time-series data. To address these issues, we propose a highly generalizable method that uses information theory-based features to identify and learn from patterns in multivariate time-series data. To demonstrate the proposed approach, we analyze pattern changes in human activity data. For applications with stochastic state transitions, features are developed based on Shannon's entropy of Markov chains, entropy rates of Markov chains, entropy production of Markov chains, and von Neumann entropy of Markov chains. For applications where state modeling is not applicable, we utilize five entropy variants, including approximate entropy, increment entropy, dispersion entropy, phase entropy, and slope entropy. The results show the proposed information theory-based features improve the recall rate, F1 score, and accuracy on average by up to 23.01% compared with the baseline models and a simpler model structure, with an average reduction of 18.75 times in the number of model parameters.
Understanding epistasis (genetic interaction) may shed some light on the genomic basis of common diseases, including disorders of maximum interest due to their high socioeconomic burden, like schizophrenia. Distance correlation is an association measure that characterises general statistical independence between random variables, not only the linear one. Here, we propose distance correlation as a novel tool for the detection of epistasis from case-control data of single-nucleotide polymorphisms (SNPs). On the methodological side, we highlight the derivation of the explicit asymptotic distribution of the test statistic. We show that this is the only way to obtain enough computational speed for the method to be used in practice, in a scenario where the resampling techniques found in the literature are impractical. Our simulations show satisfactory calibration of significance, as well as comparable or better power than existing methodology. We conclude with the application of our technique to a schizophrenia genetics dataset, obtaining biologically sound insights.
We introduce a new computational framework for estimating parameters in generalized generalized linear models (GGLM), a class of models that extends the popular generalized linear models (GLM) to account for dependencies among observations in spatio-temporal data. The proposed approach uses a monotone operator-based variational inequality method to overcome non-convexity in parameter estimation and provide guarantees for parameter recovery. The results can be applied to GLM and GGLM, focusing on spatio-temporal models. We also present online instance-based bounds using martingale concentrations inequalities. Finally, we demonstrate the performance of the algorithm using numerical simulations and a real data example for wildfire incidents.