The central space of a joint distribution $(\vX,Y)$ is the minimal subspace $\mathcal S$ such that $Y\perp\hspace{-2mm}\perp \vX \mid P_{\mathcal S}\vX$ where $P_{\mathcal S}$ is the projection onto $\mathcal S$. Sliced inverse regression (SIR), one of the most popular methods for estimating the central space, often performs poorly when the structural dimension $d=\operatorname{dim}\left( \mathcal S \right)$ is large (e.g., $\geqs 5$). In this paper, we demonstrate that the generalized signal-noise-ratio (gSNR) tends to be extremely small for a general multiple-index model when $d$ is large. Then we determine the minimax rate for estimating the central space over a large class of high dimensional distributions with a large structural dimension $d$ (i.e., there is no constant upper bound on $d$) in the low gSNR regime. This result not only extends the existing minimax rate results for estimating the central space of distributions with fixed $d$ to that with a large $d$, but also clarifies that the degradation in SIR performance is caused by the decay of signal strength. The technical tools developed here might be of independent interest for studying other central space estimation methods.
Many multivariate data sets exhibit a form of positive dependence, which can either appear globally between all variables or only locally within particular subgroups. A popular notion of positive dependence that allows for localized positivity is positive association. In this work we introduce the notion of extremal positive association for multivariate extremes from threshold exceedances. Via a sufficient condition for extremal association, we show that extremal association generalizes extremal tree models. For H\"usler--Reiss distributions the sufficient condition permits a parametric description that we call the metric property. As the parameter of a H\"usler--Reiss distribution is a Euclidean distance matrix, the metric property relates to research in electrical network theory and Euclidean geometry. We show that the metric property can be localized with respect to a graph and study surrogate likelihood inference. This gives rise to a two-step estimation procedure for locally metrical H\"usler--Reiss graphical models. The second step allows for a simple dual problem, which is implemented via a gradient descent algorithm. Finally, we demonstrate our results on simulated and real data.
In recent years, promising statistical modeling approaches to tensor data analysis have been rapidly developed. Traditional multivariate analysis tools, such as multivariate regression and discriminant analysis, are generalized from modeling random vectors and matrices to higher-order random tensors. One of the biggest challenges to statistical tensor models is the non-Gaussian nature of many real-world data. Unfortunately, existing approaches are either restricted to normality or implicitly using least squares type objective functions that are computationally efficient but sensitive to data contamination. Motivated by this, we adopt a simple tensor t-distribution that is, unlike the commonly used matrix t-distributions, compatible with tensor operators and reshaping of the data. We study the tensor response regression with tensor t-error, and develop penalized likelihood-based estimation and a novel one-step estimation. We study the asymptotic relative efficiency of various estimators and establish the one-step estimator's oracle properties and near-optimal asymptotic efficiency. We further propose a high-dimensional modification to the one-step estimation procedure and show that it attains the minimax optimal rate in estimation. Numerical studies show the excellent performance of the one-step estimator.
We consider the optimization of a smooth and strongly convex objective using constant step-size stochastic gradient descent (SGD) and study its properties through the prism of Markov chains. We show that, for unbiased gradient estimates with mildly controlled variance, the iteration converges to an invariant distribution in total variation distance. We also establish this convergence in Wasserstein-2 distance under a relaxed assumption on the gradient noise distribution compared to previous work. Thanks to the invariance property of the limit distribution, our analysis shows that the latter inherits sub-Gaussian or sub-exponential concentration properties when these hold true for the gradient. This allows the derivation of high-confidence bounds for the final estimate. Finally, under such conditions in the linear case, we obtain a dimension-free deviation bound for the Polyak-Ruppert average of a tail sequence. All our results are non-asymptotic and their consequences are discussed through a few applications.
In a task where many similar inverse problems must be solved, evaluating costly simulations is impractical. Therefore, replacing the model $y$ with a surrogate model $y_s$ that can be evaluated quickly leads to a significant speedup. The approximation quality of the surrogate model depends strongly on the number, position, and accuracy of the sample points. With an additional finite computational budget, this leads to a problem of (computer) experimental design. In contrast to the selection of sample points, the trade-off between accuracy and effort has hardly been studied systematically. We therefore propose an adaptive algorithm to find an optimal design in terms of position and accuracy. Pursuing a sequential design by incrementally appending the computational budget leads to a convex and constrained optimization problem. As a surrogate, we construct a Gaussian process regression model. We measure the global approximation error in terms of its impact on the accuracy of the identified parameter and aim for a uniform absolute tolerance, assuming that $y_s$ is computed by finite element calculations. A priori error estimates and a coarse estimate of computational effort relate the expected improvement of the surrogate model error to computational effort, resulting in the most efficient combination of sample point and evaluation tolerance. We also allow for improving the accuracy of already existing sample points by continuing previously truncated finite element solution procedures.
Learning the graphical structure of Bayesian networks is key to describing data-generating mechanisms in many complex applications but poses considerable computational challenges. Observational data can only identify the equivalence class of the directed acyclic graph underlying a Bayesian network model, and a variety of methods exist to tackle the problem. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by reverse-engineering the conditional independence (CI) relationships holding in the variable distribution. The dual PC algorithm is a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. By exploiting block matrix inversions we can also perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies show that the dual PC algorithm outperforms the classic PC algorithm both in terms of run time and in recovering the underlying network structure, even in the presence of deviations from Gaussianity. Additionally, we show that the dual PC algorithm applies for Gaussian copula models, and demonstrate its performance in that setting.
While statistical modeling of distributional data has gained increased attention, the case of multivariate distributions has been somewhat neglected despite its relevance in various applications. This is because the Wasserstein distance that is commonly used in distributional data analysis poses challenges for multivariate distributions. A promising alternative is the sliced Wasserstein distance, which offers a computationally simpler solution. We propose distributional regression models with multivariate distributions as responses paired with Euclidean vector predictors, working with the sliced Wasserstein distance, which is based on a slicing transform from the multivariate distribution space to the sliced distribution space. We introduce two regression approaches, one based on utilizing the sliced Wasserstein distance directly in the multivariate distribution space, and a second approach that employs a univariate distribution regression for each slice. We develop both global and local Fr\'echet regression methods for these approaches and establish asymptotic convergence for sample-based estimators. The proposed regression methods are illustrated in simulations and by studying joint distributions of systolic and diastolic blood pressure as a function of age and joint distributions of excess winter death rates and winter temperature anomalies in European countries as a function of a country's base winter temperature.
Prediction, in regression and classification, is one of the main aims in modern data science. When the number of predictors is large, a common first step is to reduce the dimension of the data. Sufficient dimension reduction (SDR) is a well established paradigm of reduction that keeps all the relevant information in the covariates X that is necessary for the prediction of Y . In practice, SDR has been successfully used as an exploratory tool for modelling after estimation of the sufficient reduction. Nevertheless, even if the estimated reduction is a consistent estimator of the population, there is no theory that supports this step when non-parametric regression is used in the imputed estimator. In this paper, we show that the asymptotic distribution of the non-parametric regression estimator is the same regardless if the true SDR or its estimator is used. This result allows making inferences, for example, computing confidence intervals for the regression function avoiding the curse of dimensionality.
We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for expectations and covariance matrices of the iterates are derived. In contrast with the widely cited connection between dropout and $\ell_2$-regularization in expectation, the results indicate a much more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. We also study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator.
Automatic structures are structures whose universe and relations can be represented as regular languages. It follows from the standard closure properties of regular languages that the first-order theory of an automatic structure is decidable. While existential quantifiers can be eliminated in linear time by application of a homomorphism, universal quantifiers are commonly eliminated via the identity $\forall\,x\,.\,\Phi \equiv \neg (\exists\,x\,.\,\neg \Phi)$. If $\Phi$ is represented in the standard way as an NFA, a priori this approach results in a doubly exponential blow-up. However, the recent literature has shown that there are classes of automatic structures for which universal quantifiers can be eliminated by different means without this blow-up by treating them as first-class citizens and not resorting to double complementation. While existing lower bounds for some classes of automatic structures show that a singly exponential blow-up is unavoidable when eliminating a universal quantifier, it is not known whether there may be better approaches that avoid the na\"ive doubly exponential blow-up, perhaps at least in restricted settings. In this paper, we answer this question negatively and show that there is a family of NFA representing automatic relations for which the minimal NFA recognising the language after eliminating a single universal quantifier is doubly exponential, and deciding whether this language is empty is ExpSpace-complete.
We introduce variational sequential Optimal Experimental Design (vsOED), a new method for optimally designing a finite sequence of experiments under a Bayesian framework and with information-gain utilities. Specifically, we adopt a lower bound estimator for the expected utility through variational approximation to the Bayesian posteriors. The optimal design policy is solved numerically by simultaneously maximizing the variational lower bound and performing policy gradient updates. We demonstrate this general methodology for a range of OED problems targeting parameter inference, model discrimination, and goal-oriented prediction. These cases encompass explicit and implicit likelihoods, nuisance parameters, and physics-based partial differential equation models. Our vsOED results indicate substantially improved sample efficiency and reduced number of forward model simulations compared to previous sequential design algorithms.