We develop and analyze data subsampling techniques for Poisson regression, the standard model for count data $y\in\mathbb{N}$. In particular, we consider the Poisson generalized linear model with ID- and square root-link functions. We consider the method of coresets, which are small weighted subsets that approximate the loss function of Poisson regression up to a factor of $1\pm\varepsilon$. We show $\Omega(n)$ lower bounds against coresets for Poisson regression that continue to hold against arbitrary data reduction techniques up to logarithmic factors. By introducing a novel complexity parameter and a domain shifting approach, we show that sublinear coresets with $1\pm\varepsilon$ approximation guarantee exist when the complexity parameter is small. In particular, the dependence on the number of input points can be reduced to polylogarithmic. We show that the dependence on other input parameters can also be bounded sublinearly, though not always logarithmically. In particular, we show that the square root-link admits an $O(\log(y_{\max}))$ dependence, where $y_{\max}$ denotes the largest count presented in the data, while the ID-link requires a $\Theta(\sqrt{y_{\max}/\log(y_{\max})})$ dependence. As an auxiliary result for proving the tightness of the bound with respect to $y_{\max}$ in the case of the ID-link, we show an improved bound on the principal branch of the Lambert $W_0$ function, which may be of independent interest. We further show the limitations of our analysis when $p$th degree root-link functions for $p\geq 3$ are considered, which indicate that other analytical or computational methods would be required if such a generalization is even possible.
Gradient descent is one of the most widely used iterative algorithms in modern statistical learning. However, its precise algorithmic dynamics in high-dimensional settings remain only partially understood, which has therefore limited its broader potential for statistical inference applications. This paper provides a precise, non-asymptotic distributional characterization of gradient descent iterates in a broad class of empirical risk minimization problems, in the so-called mean-field regime where the sample size is proportional to the signal dimension. Our non-asymptotic state evolution theory holds for both general non-convex loss functions and non-Gaussian data, and reveals the central role of two Onsager correction matrices that precisely characterize the non-trivial dependence among all gradient descent iterates in the mean-field regime. Although the Onsager correction matrices are typically analytically intractable, our state evolution theory facilitates a generic gradient descent inference algorithm that consistently estimates these matrices across a broad class of models. Leveraging this algorithm, we show that the state evolution can be inverted to construct (i) data-driven estimators for the generalization error of gradient descent iterates and (ii) debiased gradient descent iterates for inference of the unknown signal. Detailed applications to two canonical models--linear regression and (generalized) logistic regression--are worked out to illustrate model-specific features of our general theory and inference methods.
Runge-Kutta methods have an irreplaceable position among numerical methods designed to solve ordinary differential equations. Especially, implicit ones are suitable for approximating solutions of stiff initial value problems. We propose a new way of deriving coefficients of implicit Runge-Kutta methods. This approach based on repeated integrals yields both new and well-known Butcher's tableaux. We discuss the properties of newly derived methods and compare them with standard collocation implicit Runge-Kutta methods in a series of numerical experiments, including the Prothero-Robinson problem.
We consider the problem of model selection when grouping structure is inherent within the regressors. Using a Bayesian approach, we model the mean vector by a one-group global-local shrinkage prior belonging to a broad class of such priors that includes the horseshoe prior. In the context of variable selection, this class of priors was studied by Tang et al. (2018). A modified form of the usual class of global-local shrinkage priors with polynomial tail on the group regression coefficients is proposed. The resulting threshold rule selects the active group if within a group, the ratio of the $L_2$ norm of the posterior mean of its group coefficient to that of the corresponding ordinary least square group estimate is greater than a half. In the theoretical part of this article, we have used the global shrinkage parameter either as a tuning one or an empirical Bayes estimate of it depending on the knowledge regarding the underlying sparsity of the model. When the proportion of active groups is known, using $\tau$ as a tuning parameter, we have proved that our method is oracle. In case this proportion is unknown, we propose an empirical Bayes estimate of $\tau$. Even if this empirical Bayes estimate is used, then also our half-thresholding rule captures the truly important groups and obtains optimal estimation rate of the group coefficients simultaneously. Though our theoretical works rely on a special form of the design matrix, for general design matrices also, our simulation results show that the half-thresholding rule yields results similar to that of Yang and Narisetty (2020). As a consequence of this, in a high dimensional sparse group selection problem, instead of using the so-called `gold standard' spike and slab prior, one can use the one-group global-local shrinkage priors with polynomial tail to obtain similar results.
In order to determine the sparse approximation function which has a direct metric relationship with the $\ell_{0}$ quasi-norm, we introduce a wonderful triangle whose sides are composed of $\Vert \mathbf{x} \Vert_{0}$, $\Vert \mathbf{x} \Vert_{1}$ and $\Vert \mathbf{x} \Vert_{\infty}$ for any non-zero vector $\mathbf{x} \in \mathbb{R}^{n}$ by delving into the iterative soft-thresholding operator in this paper. Based on this triangle, we deduce the ratio $\ell_{1}$ and $\ell_{\infty}$ norms as a sparsity-promoting objective function for sparse signal reconstruction and also try to give the sparsity interval of the signal. Considering the $\ell_{1}/\ell_{\infty}$ minimization from a angle $\beta$ of the triangle corresponding to the side whose length is $\Vert \mathbf{x} \Vert_{\infty} - \Vert \mathbf{x} \Vert_{1}/\Vert \mathbf{x} \Vert_{0}$, we finally demonstrate the performance of existing $\ell_{1}/\ell_{\infty}$ algorithm by comparing it with $\ell_{1}/\ell_{2}$ algorithm.
An extremely schematic model of the forces acting an a sailing yacht equipped with a system of foils is here presented and discussed. The role of the foils is to raise the hull from the water in order to reduce the total resistance and then increase the speed. Some CFD simulations are providing the total resistance of the bare hull at some values of speed and displacement, as well as the characteristics (drag and lift coefficients) of the 2D foil sections used for the appendages. A parametric study has been performed for the characterization of a foil of finite dimensions. The equilibrium of the vertical forces and longitudinal moments, as well as a reduced displacement, is obtained by controlling the pitch angle of the foils. The value of the total resistance of the yacht with foils is then compared with the case without foils, evidencing the speed regime where an advantage is obtained, if any.
We present a theory and an objective function for similarity-based hierarchical clustering of probabilistic partial orders and directed acyclic graphs (DAGs). Specifically, given elements $x \le y$ in the partial order, and their respective clusters $[x]$ and $[y]$, the theory yields an order relation $\le'$ on the clusters such that $[x]\le'[y]$. The theory provides a concise definition of order-preserving hierarchical clustering, and offers a classification theorem identifying the order-preserving trees (dendrograms). To determine the optimal order-preserving trees, we develop an objective function that frames the problem as a bi-objective optimisation, aiming to satisfy both the order relation and the similarity measure. We prove that the optimal trees under the objective are both order-preserving and exhibit high-quality hierarchical clustering. Since finding an optimal solution is NP-hard, we introduce a polynomial-time approximation algorithm and demonstrate that the method outperforms existing methods for order-preserving hierarchical clustering by a significant margin.
Stochastic gradient descent (SGD) is a workhorse algorithm for solving large-scale optimization problems in data science and machine learning. Understanding the convergence of SGD is hence of fundamental importance. In this work we examine the SGD convergence (with various step sizes) when applied to unconstrained convex quadratic programming (essentially least-squares (LS) problems), and in particular analyze the error components respect to the eigenvectors of the Hessian. The main message is that the convergence depends largely on the corresponding eigenvalues (singular values of the coefficient matrix in the LS context), namely the components for the large singular values converge faster in the initial phase. We then show there is a phase transition in the convergence where the convergence speed of the components, especially those corresponding to the larger singular values, will decrease. Finally, we show that the convergence of the overall error (in the solution) tends to decay as more iterations are run, that is, the initial convergence is faster than the asymptote.
In-vitro dissolution testing is a critical component in the quality control of manufactured drug products. The $\mathrm{f}_2$ statistic is the standard for assessing similarity between two dissolution profiles. However, the $\mathrm{f}_2$ statistic has known limitations: it lacks an uncertainty estimate, is a discrete-time metric, and is a biased measure, calculating the differences between profiles at discrete time points. To address these limitations, we propose a Gaussian Process (GP) with a dissolution spline kernel for dissolution profile comparison. The dissolution spline kernel is a new spline kernel using logistic functions as its basis functions, enabling the GP to capture the expected monotonic increase in dissolution curves. This results in better predictions of dissolution curves. This new GP model reduces bias in the $\mathrm{f}_2$ calculation by allowing predictions to be interpolated in time between observed values, and provides uncertainty quantification. We assess the model's performance through simulations and real datasets, demonstrating its improvement over a previous GP-based model introduced for dissolution testing. We also show that the new model can be adapted to include dissolution-specific covariates. Applying the model to real ibuprofen dissolution data under various conditions, we demonstrate its ability to extrapolate curve shapes across different experimental settings.
We propose to quantify dependence between two systems $X$ and $Y$ in a dataset $D$ based on the Bayesian comparison of two models: one, $H_0$, of statistical independence and another one, $H_1$, of dependence. In this framework, dependence between $X$ and $Y$ in $D$, denoted $B(X,Y|D)$, is quantified as $P(H_1|D)$, the posterior probability for the model of dependence given $D$, or any strictly increasing function thereof. It is therefore a measure of the evidence for dependence between $X$ and $Y$ as modeled by $H_1$ and observed in $D$. We review several statistical models and reconsider standard results in the light of $B(X,Y|D)$ as a measure of dependence. Using simulations, we focus on two specific issues: the effect of noise and the behavior of $B(X,Y|D)$ when $H_1$ has a parameter coding for the intensity of dependence. We then derive some general properties of $B(X,Y|D)$, showing that it quantifies the information contained in $D$ in favor of $H_1$ versus $H_0$. While some of these properties are typical of what is expected from a valid measure of dependence, others are novel and naturally appear as desired features for specific measures of dependence, which we call inferential. We finally put these results in perspective; in particular, we discuss the consequences of using the Bayesian framework as well as the similarities and differences between $B(X,Y|D)$ and mutual information.
We extend several celebrated methods in classical analysis for summing series of complex numbers to series of complex matrices. These include the summation methods of Abel, Borel, Ces\'aro, Euler, Lambert, N\"orlund, and Mittag-Leffler, which are frequently used to sum scalar series that are divergent in the conventional sense. One feature of our matrix extensions is that they are fully noncommutative generalizations of their scalar counterparts -- not only is the scalar series replaced by a matrix series, positive weights are replaced by positive definite matrix weights, order on $\mathbb{R}$ replaced by Loewner order, exponential function replaced by matrix exponential function, etc. We will establish the regularity of our matrix summation methods, i.e., when applied to a matrix series convergent in the conventional sense, we obtain the same value for the sum. Our second goal is to provide numerical algorithms that work in conjunction with these summation methods. We discuss how the block and mixed-block summation algorithms, the Kahan compensated summation algorithm, may be applied to matrix sums with similar roundoff error bounds. These summation methods and algorithms apply not only to power or Taylor series of matrices but to any general matrix series including matrix Fourier and Dirichlet series. We will demonstrate the utility of these summation methods: establishing a Fej\'{e}r's theorem and alleviating the Gibbs phenomenon for matrix Fourier series; extending the domains of matrix functions and accurately evaluating them; enhancing the matrix Pad\'e approximation and Schur--Parlett algorithms; and more.