In this paper, new results in random matrix theory are derived which allow us to construct a shrinkage estimator of the global minimum variance (GMV) portfolio when the shrinkage target is a random object. More specifically, the shrinkage target is determined as the holding portfolio estimated from previous data. The theoretical findings are applied to develop theory for dynamic estimation of the GMV portfolio, where the new estimator of its weights is shrunk to the holding portfolio at each time of reconstruction. Both cases with and without overlapping samples are considered in the paper. The non-overlapping samples corresponds to the case when different data of the asset returns are used to construct the traditional estimator of the GMV portfolio weights and to determine the target portfolio, while the overlapping case allows intersections between the samples. The theoretical results are derived under weak assumptions imposed on the data-generating process. No specific distribution is assumed for the asset returns except from the assumption of finite $4+\varepsilon$, $\varepsilon>0$, moments. Also, the population covariance matrix with unbounded spectrum can be considered. The performance of new trading strategies is investigated via an extensive simulation. Finally, the theoretical findings are implemented in an empirical illustration based on the returns on stocks included in the S\&P 500 index.
We present a new finite-sample analysis of M-estimators of locations in $\mathbb{R}^d$ using the tool of the influence function. In particular, we show that the deviations of an M-estimator can be controlled thanks to its influence function (or its score function) and then, we use concentration inequality on M-estimators to investigate the robust estimation of the mean in high dimension in a corrupted setting (adversarial corruption setting) for bounded and unbounded score functions. For a sample of size $n$ and covariance matrix $\Sigma$, we attain the minimax speed $\sqrt{Tr(\Sigma)/n}+\sqrt{\|\Sigma\|_{op}\log(1/\delta)/n}$ with probability larger than $1-\delta$ in a heavy-tailed setting. One of the major advantages of our approach compared to others recently proposed is that our estimator is tractable and fast to compute even in very high dimension with a complexity of $O(nd\log(Tr(\Sigma)))$ where $n$ is the sample size and $\Sigma$ is the covariance matrix of the inliers. In practice, the code that we make available for this article proves to be very fast.
A connected dominating set is a widely adopted model for the virtual backbone of a wireless sensor network. In this paper, we design an evolutionary algorithm for the minimum connected dominating set problem (MinCDS), whose performance is theoretically guaranteed in terms of both computation time and approximation ratio. Given a connected graph $G=(V,E)$, a connected dominating set (CDS) is a subset $C\subseteq V$ such that every vertex in $V\setminus C$ has a neighbor in $C$, and the subgraph of $G$ induced by $C$ is connected. The goal of MinCDS is to find a CDS of $G$ with the minimum cardinality. We show that our evolutionary algorithm can find a CDS in expected $O(n^3)$ time which approximates the optimal value within factor $(2+\ln\Delta)$, where $n$ and $\Delta$ are the number of vertices and the maximum degree of graph $G$, respectively.
This paper deals with robust inference for parametric copula models. Estimation using Canonical Maximum Likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the Maximum Mean Discrepancy (MMD) principle. We derive non-asymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on $[0,1]^d$, as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available.
We develop an \textit{a posteriori} error analysis for the time of the first occurrence of an event, specifically, the time at which a functional of the solution to a partial differential equation (PDE) first achieves a threshold value on a given time interval. This novel quantity of interest (QoI) differs from classical QoIs which are modeled as bounded linear (or nonlinear) functionals. Taylor's theorem and an adjoint-based \textit{a posteriori} analysis is used to derive computable and accurate error estimates for semi-linear parabolic and hyperbolic PDEs. The accuracy of the error estimates is demonstrated through numerical solutions of the one-dimensional heat equation and linearized shallow water equations (SWE), representing parabolic and hyperbolic cases, respectively.
Hawkes processes are point processes that model data where events occur in clusters through the self-exciting property of the intensity function. We consider a multivariate setting where multiple dimensions can influence each other with intensity function to allow for excitation and inhibition, both within and across dimensions. We discuss how such a model can be implemented and highlight challenges in the estimation procedure induced by a potentially negative intensity function. Furthermore, we introduce a new, stronger condition for stability that encompasses current approaches established in the literature. Finally, we examine the total number of offsprings to reparametrise the model and subsequently use Normal and sparsity-inducing priors in a Bayesian estimation procedure on simulated data.
The popular LSPE($\lambda$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.
The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational biology and allied sciences. While the dimensionality of such datasets continues to grow, so too does the complexity of biomarker identification from exposure patterns in health studies measuring baseline confounders; moreover, doing so while avoiding model misspecification remains an issue only partially addressed. Efficient estimators capable of incorporating flexible, data adaptive regression techniques in estimating relevant components of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems that require the simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control even under standard multiple testing corrections. We present a general approach for applying empirical Bayes shrinkage to variance estimators of a family of efficient, asymptotically linear estimators of population intervention causal effects. Our generalization of shrinkage-based variance estimators increases inferential stability in high-dimensional settings, facilitating the application of these estimators for deriving nonparametric variable importance measures in high-dimensional biological datasets with modest sample sizes. The result is a data adaptive approach for robustly uncovering stable causal associations in high-dimensional data in studies with limited samples. Our generalized variance estimator is evaluated against alternative variance estimators in numerical experiments. Identification of biomarkers with the proposed methodology is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.
This paper studies efficient estimation of causal effects when treatment is (quasi-) randomly rolled out to units at different points in time. We solve for the most efficient estimator in a class of estimators that nests two-way fixed effects models and other popular generalized difference-in-differences methods. A feasible plug-in version of the efficient estimator is asymptotically unbiased with efficiency (weakly) dominating that of existing approaches. We provide both $t$-based and permutation-test based methods for inference. We illustrate the performance of the plug-in efficient estimator in simulations and in an application to \citet{wood_procedural_2020}'s study of the staggered rollout of a procedural justice training program for police officers. We find that confidence intervals based on the plug-in efficient estimator have good coverage and can be as much as eight times shorter than confidence intervals based on existing state-of-the-art methods. As an empirical contribution of independent interest, our application provides the most precise estimates to date on the effectiveness of procedural justice training programs for police officers.
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$ which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.
We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.