We have developed a statistical inference method applicable to a broad range of generalized linear models (GLMs) in high-dimensional settings, where the number of unknown coefficients scales proportionally with the sample size. Although a pioneering method has been developed for logistic regression, which is a specific instance of GLMs, its direct applicability to other GLMs remains limited. In this study, we address this limitation by developing a new inference method designed for a class of GLMs with asymmetric link functions. More precisely, we first introduce a novel convex loss-based estimator and its associated system, which are essential components for the inference. We next devise a methodology for identifying parameters of the system required within the method. Consequently, we construct confidence intervals for GLMs in the high-dimensional regime. We prove that our proposal has desirable theoretical properties, such as strong consistency and exact coverage probability. Finally, we confirm the validity in experiments.
In this paper, we focus on the high-dimensional double sparse structure, where the parameter of interest simultaneously encourages group-wise sparsity and element-wise sparsity in each group. By combining the Gilbert-Varshamov bound and its variants, we develop a novel lower bound technique for the metric entropy of the parameter space, specifically tailored for the double sparse structure over $\ell_u(\ell_q)$-balls with $u,q \in [0,1]$. We prove lower bounds on the estimation error using an information-theoretic approach, leveraging our proposed lower bound technique and Fano's inequality. To complement the lower bounds, we establish matching upper bounds through a direct analysis of constrained least-squares estimators and utilize results from empirical processes. A significant finding of our study is the discovery of a phase transition phenomenon in the minimax rates for $u,q \in (0, 1]$. Furthermore, we extend the theoretical results to the double sparse regression model and determine its minimax rate for estimation error. To tackle double sparse linear regression, we develop the DSIHT (Double Sparse Iterative Hard Thresholding) algorithm, demonstrating its optimality in the minimax sense. Finally, we demonstrate the superiority of our method through numerical experiments.
We study the problem of exact support recovery for high-dimensional sparse linear regression under independent Gaussian design when the signals are weak, rare, and possibly heterogeneous. Under a suitable scaling of the sample size and signal sparsity, we fix the minimum signal magnitude at the information-theoretic optimal rate and investigate the asymptotic selection accuracy of best subset selection (BSS) and marginal screening (MS) procedures. We show that despite the ideal setup, somewhat surprisingly, marginal screening can fail to achieve exact recovery with probability converging to one in the presence of heterogeneous signals, whereas BSS enjoys model consistency whenever the minimum signal strength is above the information-theoretic threshold. To mitigate the computational intractability of BSS, we also propose an efficient two-stage algorithmic framework called ETS (Estimate Then Screen) comprised of an estimation step and gradient coordinate screening step, and under the same scaling assumption on sample size and sparsity, we show that ETS achieves model consistency under the same information-theoretic optimal requirement on the minimum signal strength as BSS. Finally, we present a simulation study comparing ETS with LASSO and marginal screening. The numerical results agree with our asymptotic theory even for realistic values of the sample size, dimension and sparsity.
We propose a general purpose confidence interval procedure (CIP) for statistical functionals constructed using data from a stationary time series. The procedures we propose are based on derived distribution-free analogues of the $\chi^2$ and Student's $t$ random variables for the statistical functional context, and hence apply in a wide variety of settings including quantile estimation, gradient estimation, M-estimation, CVAR-estimation, and arrival process rate estimation, apart from more traditional statistical settings. Like the method of subsampling, we use overlapping batches of time series data to estimate the underlying variance parameter; unlike subsampling and the bootstrap, however, we assume that the implied point estimator of the statistical functional obeys a central limit theorem (CLT) to help identify the weak asymptotics (called OB-x limits, x=I,II,III) of batched Studentized statistics. The OB-x limits, certain functionals of the Wiener process parameterized by the size of the batches and the extent of their overlap, form the essential machinery for characterizing dependence, and consequently the correctness of the proposed CIPs. The message from extensive numerical experimentation is that in settings where a functional CLT on the point estimator is in effect, using \emph{large overlapping batches} alongside OB-x critical values yields confidence intervals that are often of significantly higher quality than those obtained from more generic methods like subsampling or the bootstrap. We illustrate using examples from CVaR estimation, ARMA parameter estimation, and NHPP rate estimation; R and MATLAB code for OB-x critical values is available at~\texttt{web.ics.purdue.edu/~pasupath/}.
We consider hypergraph network design problems where the goal is to construct a hypergraph satisfying certain properties. In graph network design problems, the number of edges in an arbitrary solution is at most the square of the number of vertices. In contrast, in hypergraph network design problems, the number of hyperedges in an arbitrary solution could be exponential in the number of vertices and hence, additional care is necessary to design polynomial-time algorithms. The central theme of this work is to show that certain hypergraph network design problems admit solutions with polynomial number of hyperedges and moreover, can be solved in strongly polynomial time. Our work improves on the previous fastest pseudo-polynomial run-time for these problems. In addition, we develop algorithms that return (near-)uniform hypergraphs as solutions. The hypergraph network design problems that we focus upon are splitting-off operation in hypergraphs, connectivity augmentation using hyperedges, and covering skew-supermodular functions using hyperedges. Our definition of the splitting-off operation in hypergraphs and our proof showing the existence of the operation using a strongly polynomial-time algorithm to compute it are likely to be of independent graph-theoretical interest.
The classical latent factor model for linear regression is extended by assuming that, up to an unknown orthogonal transformation, the features consist of subsets that are relevant and irrelevant for the response. Furthermore, a joint low-dimensionality is imposed only on the relevant features vector and the response variable. This framework allows for a comprehensive study of the partial-least-squares (PLS) algorithm under random design. In particular, a novel perturbation bound for PLS solutions is proven and the high-probability $L^2$-estimation rate for the PLS estimator is obtained. This novel framework also sheds light on the performance of other regularisation methods for ill-posed linear regression that exploit sparsity or unsupervised projection. The theoretical findings are confirmed by numerical studies on both real and simulated data.
Algorithms for online learning typically require one or more boundedness assumptions: that the domain is bounded, that the losses are Lipschitz, or both. In this paper, we develop a new setting for online learning with unbounded domains and non-Lipschitz losses. For this setting we provide an algorithm which guarantees $R_{T}(u)\le \tilde O(G\|u\|\sqrt{T}+L\|u\|^{2}\sqrt{T})$ regret on any problem where the subgradients satisfy $\|g_{t}\|\le G+L\|w_{t}\|$, and show that this bound is unimprovable without further assumptions. We leverage this algorithm to develop new saddle-point optimization algorithms that converge in duality gap in unbounded domains, even in the absence of meaningful curvature. Finally, we provide the first algorithm achieving non-trivial dynamic regret in an unbounded domain for non-Lipschitz losses, as well as a matching lower bound. The regret of our dynamic regret algorithm automatically improves to a novel $L^{*}$ bound when the losses are smooth.
Statistical inference of the high-dimensional regression coefficients is challenging because the uncertainty introduced by the model selection procedure is hard to account for. A critical question remains unsettled; that is, is it possible and how to embed the inference of the model into the simultaneous inference of the coefficients? To this end, we propose a notion of simultaneous confidence intervals called the sparsified simultaneous confidence intervals. Our intervals are sparse in the sense that some of the intervals' upper and lower bounds are shrunken to zero (i.e., $[0,0]$), indicating the unimportance of the corresponding covariates. These covariates should be excluded from the final model. The rest of the intervals, either containing zero (e.g., $[-1,1]$ or $[0,1]$) or not containing zero (e.g., $[2,3]$), indicate the plausible and significant covariates, respectively. The proposed method can be coupled with various selection procedures, making it ideal for comparing their uncertainty. For the proposed method, we establish desirable asymptotic properties, develop intuitive graphical tools for visualization, and justify its superior performance through simulation and real data analysis.
The widespread use of maximum Jeffreys'-prior penalized likelihood in binomial-response generalized linear models, and in logistic regression, in particular, are supported by the results of Kosmidis and Firth (2021, Biometrika), who show that the resulting estimates are also always finite-valued, even in cases where the maximum likelihood estimates are not, which is a practical issue regardless of the size of the data set. In logistic regression, the implied adjusted score equations are formally bias-reducing in asymptotic frameworks with a fixed number of parameters and appear to deliver a substantial reduction in the persistent bias of the maximum likelihood estimator in high-dimensional settings where the number of parameters grows asymptotically linearly and slower than the number of observations. In this work, we develop and present two new variants of iteratively reweighted least squares for estimating generalized linear models with adjusted score equations for mean bias reduction and maximization of the likelihood penalized by a positive power of the Jeffreys-prior penalty, which eliminate the requirement of storing $O(n)$ quantities in memory, and can operate with data sets that exceed computer memory or even hard drive capacity. We achieve that through incremental QR decompositions, which enable IWLS iterations to have access only to data chunks of predetermined size. We assess the procedures through a real-data application with millions of observations, and in high-dimensional logistic regression, where a large-scale simulation experiment produces concrete evidence for the existence of a simple adjustment to the maximum Jeffreys'-penalized likelihood estimates that delivers high accuracy in terms of signal recovery even in cases where estimates from ML and other recently-proposed corrective methods do not exist.
We introduce a physics-driven deep latent variable model (PDDLVM) to learn simultaneously parameter-to-solution (forward) and solution-to-parameter (inverse) maps of parametric partial differential equations (PDEs). Our formulation leverages conventional PDE discretization techniques, deep neural networks, probabilistic modelling, and variational inference to assemble a fully probabilistic coherent framework. In the posited probabilistic model, both the forward and inverse maps are approximated as Gaussian distributions with a mean and covariance parameterized by deep neural networks. The PDE residual is assumed to be an observed random vector of value zero, hence we model it as a random vector with a zero mean and a user-prescribed covariance. The model is trained by maximizing the probability, that is the evidence or marginal likelihood, of observing a residual of zero by maximizing the evidence lower bound (ELBO). Consequently, the proposed methodology does not require any independent PDE solves and is physics-informed at training time, allowing the real-time solution of PDE forward and inverse problems after training. The proposed framework can be easily extended to seamlessly integrate observed data to solve inverse problems and to build generative models. We demonstrate the efficiency and robustness of our method on finite element discretized parametric PDE problems such as linear and nonlinear Poisson problems, elastic shells with complex 3D geometries, and time-dependent nonlinear and inhomogeneous PDEs using a physics-informed neural network (PINN) discretization. We achieve up to three orders of magnitude speed-up after training compared to traditional finite element method (FEM), while outputting coherent uncertainty estimates.
Residual bootstrap is a classical method for statistical inference in regression settings. With massive data sets becoming increasingly common, there is a demand for computationally efficient alternatives to residual bootstrap. We propose a simple and versatile scalable algorithm called subsampled residual bootstrap (SRB) for generalized linear models (GLMs), a large class of regression models that includes the classical linear regression model as well as other widely used models such as logistic, Poisson and probit regression. We prove consistency and distributional results that establish that the SRB has the same theoretical guarantees under the GLM framework as the classical residual bootstrap, while being computationally much faster. We demonstrate the empirical performance of SRB via simulation studies and a real data analysis of the Forest Covertype data from the UCI Machine Learning Repository.