The model-X knockoffs framework provides a flexible tool for achieving finite-sample false discovery rate (FDR) control in variable selection in arbitrary dimensions without assuming any dependence structure of the response on covariates. It also completely bypasses the use of conventional p-values, making it especially appealing in high-dimensional nonlinear models. Existing works have focused on the setting of independent and identically distributed observations. Yet time series data is prevalent in practical applications in various fields such as economics and social sciences. This motivates the study of model-X knockoffs inference for time series data. In this paper, we make some initial attempt to establish the theoretical and methodological foundation for the model-X knockoffs inference for time series data. We suggest the method of time series knockoffs inference (TSKI) by exploiting the ideas of subsampling and e-values to address the difficulty caused by the serial dependence. We also generalize the robust knockoffs inference to the time series setting and relax the assumption of known covariate distribution required by model-X knockoffs, because such an assumption is overly stringent for time series data. We establish sufficient conditions under which TSKI achieves the asymptotic FDR control. Our technical analysis reveals the effects of serial dependence and unknown covariate distribution on the FDR control. We conduct power analysis of TSKI using the Lasso coefficient difference knockoff statistic under linear time series models. The finite-sample performance of TSKI is illustrated with several simulation examples and an economic inflation study.
We investigate the robustness of the model-X knockoffs framework with respect to the misspecified or estimated feature distribution. We achieve such a goal by theoretically studying the feature selection performance of a practically implemented knockoffs algorithm, which we name as the approximate knockoffs (ARK) procedure, under the measures of the false discovery rate (FDR) and family wise error rate (FWER). The approximate knockoffs procedure differs from the model-X knockoffs procedure only in that the former uses the misspecified or estimated feature distribution. A key technique in our theoretical analyses is to couple the approximate knockoffs procedure with the model-X knockoffs procedure so that random variables in these two procedures can be close in realizations. We prove that if such coupled model-X knockoffs procedure exists, the approximate knockoffs procedure can achieve the asymptotic FDR or FWER control at the target level. We showcase three specific constructions of such coupled model-X knockoff variables, verifying their existence and justifying the robustness of the model-X knockoffs framework.
We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include $L_p$ norms, $k$-support norms, top-$k$ norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits $(1+\alpha)$-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters.
In many applications in biology, engineering and economics, identifying similarities and differences between distributions of data from complex processes requires comparing finite categorical samples of discrete counts. Statistical divergences quantify the difference between two distributions. However, their estimation is very difficult and empirical methods often fail, especially when the samples are small. We develop a Bayesian estimator of the Kullback-Leibler divergence between two probability distributions that makes use of a mixture of Dirichlet priors on the distributions being compared. We study the properties of the estimator on two examples: probabilities drawn from Dirichlet distributions, and random strings of letters drawn from Markov chains. We extend the approach to the squared Hellinger divergence. Both estimators outperform other estimation techniques, with better results for data with a large number of categories and for higher values of divergences.
Approximating significance scans of searches for new particles in high-energy physics experiments as Gaussian fields is a well-established way to estimate the trials factors required to quantify global significances. We propose a novel, highly efficient method to estimate the covariance matrix of such a Gaussian field. The method is based on the linear approximation of statistical fluctuations of the signal amplitude. For one-dimensional searches the upper bound on the trials factor can then be calculated directly from the covariance matrix. For higher dimensions, the Gaussian process described by this covariance matrix may be sampled to calculate the trials factor directly. This method also serves as the theoretical basis for a recent study of the trials factor with an empirically constructed set of Asmiov-like background datasets. We illustrate the method with studies of a $H \rightarrow \gamma \gamma$ inspired model that was used in the empirical paper.
Variable screening has been a useful research area that deals with ultrahigh-dimensional data. When there exist both marginally and jointly dependent predictors to the response, existing methods such as conditional screening or iterative screening often suffer from instability against the selection of the conditional set or the computational burden, respectively. In this article, we propose a new independence measure, named conditional martingale difference divergence (CMDH), that can be treated as either a conditional or a marginal independence measure. Under regularity conditions, we show that the sure screening property of CMDH holds for both marginally and jointly active variables. Based on this measure, we propose a kernel-based model-free variable screening method, which is efficient, flexible, and stable against high correlation among predictors and heterogeneity of the response. In addition, we provide a data-driven method to select the conditional set. In simulations and real data applications, we demonstrate the superior performance of the proposed method.
A methodology for high dimensional causal inference in a time series context is introduced. It is assumed that there is a monotonic transformation of the data such that the dynamics of the transformed variables are described by a Gaussian vector autoregressive process. This is tantamount to assume that the dynamics are captured by a Gaussian copula. No knowledge or estimation of the marginal distribution of the data is required. The procedure consistently identifies the parameters that describe the dynamics of the process and the conditional causal relations among the possibly high dimensional variables under sparsity conditions. The methodology allows us to identify such causal relations in the form of a directed acyclic graph. As illustrative applications we consider the impact of supply side oil shocks on the economy, and the causal relations between aggregated variables constructed from the limit order book on four stock constituents of the S&P500.
We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structures from high-dimensional and noisy observations, where the datasets are assumed to be sampled from an intrinsically low-dimensional manifold and corrupted by high-dimensional noise. The algorithm employs an adaptive bandwidth selection procedure which does not rely on prior knowledge of the underlying manifold. The obtained low-dimensional embeddings can be further utilized for downstream purposes such as data visualization, clustering and prediction. Our method is theoretically justified and practically interpretable. Specifically, we establish the convergence of the final embeddings to their noiseless counterparts when the dimension and size of the samples are comparably large, and characterize the effect of the signal-to-noise ratio on the rate of convergence and phase transition. We also prove convergence of the embeddings to the eigenfunctions of an integral operator defined by the kernel map of some reproducing kernel Hilbert space capturing the underlying nonlinear structures. Numerical simulations and analysis of three real datasets show the superior empirical performance of the proposed method, compared to many existing methods, on learning various manifolds in diverse applications.
Numerous benchmarks for Few-Shot Learning have been proposed in the last decade. However all of these benchmarks focus on performance averaged over many tasks, and the question of how to reliably evaluate and tune models trained for individual tasks in this regime has not been addressed. This paper presents the first investigation into task-level evaluation -- a fundamental step when deploying a model. We measure the accuracy of performance estimators in the few-shot setting, consider strategies for model selection, and examine the reasons for the failure of evaluators usually thought of as being robust. We conclude that cross-validation with a low number of folds is the best choice for directly estimating the performance of a model, whereas using bootstrapping or cross validation with a large number of folds is better for model selection purposes. Overall, we find that existing benchmarks for few-shot learning are not designed in such a way that one can get a reliable picture of how effectively methods can be used on individual tasks.
Analyzing observational data from multiple sources can be useful for increasing statistical power to detect a treatment effect; however, practical constraints such as privacy considerations may restrict individual-level information sharing across data sets. This paper develops federated methods that only utilize summary-level information from heterogeneous data sets. Our federated methods provide doubly-robust point estimates of treatment effects as well as variance estimates. We derive the asymptotic distributions of our federated estimators, which are shown to be asymptotically equivalent to the corresponding estimators from the combined, individual-level data. We show that to achieve these properties, federated methods should be adjusted based on conditions such as whether models are correctly specified and stable across heterogeneous data sets.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.