Kronecker regression is a highly-structured least squares problem $\min_{\mathbf{x}} \lVert \mathbf{K}\mathbf{x} - \mathbf{b} \rVert_{2}^2$, where the design matrix $\mathbf{K} = \mathbf{A}^{(1)} \otimes \cdots \otimes \mathbf{A}^{(N)}$ is a Kronecker product of factor matrices. This regression problem arises in each step of the widely-used alternating least squares (ALS) algorithm for computing the Tucker decomposition of a tensor. We present the first subquadratic-time algorithm for solving Kronecker regression to a $(1+\varepsilon)$-approximation that avoids the exponential term $O(\varepsilon^{-N})$ in the running time. Our techniques combine leverage score sampling and iterative methods. By extending our approach to block-design matrices where one block is a Kronecker product, we also achieve subquadratic-time algorithms for (1) Kronecker ridge regression and (2) updating the factor matrices of a Tucker decomposition in ALS, which is not a pure Kronecker regression problem, thereby improving the running time of all steps of Tucker ALS. We demonstrate the speed and accuracy of this Kronecker regression algorithm on synthetic data and real-world image tensors.
This paper presents the application of a new semi-analytical method of linear regression for Poisson count data to COVID-19 events. The regression is based on the Bonamente and Spence (2022) maximum-likelihood solution for the best-fit parameters, and this paper introduces a simple analytical solution for the covariance matrix that completes the problem of linear regression with Poisson data. The analytical nature for both parameter estimates and their covariance matrix is made possible by a convenient factorization of the linear model proposed by J. Scargle (2013). The method makes use of the asymptotic properties of the Fisher information matrix, whose inverse provides the covariance matrix. The combination of simple analytical methods to obtain both the maximum-likelihood estimates of the parameters, and their covariance matrix, constitute a new and convenient method for the linear regression of Poisson-distributed count data, which are of common occurrence across a variety of fields. A comparison between this new linear regression method and two alternative methods often used for the regression of count data -- the ordinary least-square regression and the $\chi^2$ regression -- is provided with the application of these methods to the analysis of recent COVID-19 count data. The paper also discusses the relative advantages and disadvantages among these methods for the linear regression of Poisson count data.
In this paper, we investigate computational power of threshold circuits and other theoretical models of neural networks in terms of the following four complexity measures: size (the number of gates), depth, weight and energy. Here the energy complexity of a circuit measures sparsity of their computation, and is defined as the maximum number of gates outputting non-zero values taken over all the input assignments. As our main result, we prove that any threshold circuit $C$ of size $s$, depth $d$, energy $e$ and weight $w$ satisfies $\log (rk(M_C)) \le ed (\log s + \log w + \log n)$, where $rk(M_C)$ is the rank of the communication matrix $M_C$ of a $2n$-variable Boolean function that $C$ computes. Thus, such a threshold circuit $C$ is able to compute only a Boolean function of which communication matrix has rank bounded by a product of logarithmic factors of $s,w$ and linear factors of $d,e$. This implies an exponential lower bound on the size of even sublinear-depth threshold circuit if energy and weight are sufficiently small. For other models of neural networks such as a discretized ReLE circuits and decretized sigmoid circuits, we prove that a similar inequality also holds for a discretized circuit $C$: $rk(M_C) = O(ed(\log s + \log w + \log n)^3)$.
We study the design of embeddings into Euclidean space with outliers. Given a metric space $(X,d)$ and an integer $k$, the goal is to embed all but $k$ points in $X$ (called the "outliers") into $\ell_2$ with the smallest possible distortion $c$. Finding the optimal distortion $c$ for a given outlier set size $k$, or alternately the smallest $k$ for a given target distortion $c$ are both NP-hard problems. In fact, it is UGC-hard to approximate $k$ to within a factor smaller than $2$ even when the metric sans outliers is isometrically embeddable into $\ell_2$. We consider bi-criteria approximations. Our main result is a polynomial time algorithm that approximates the outlier set size to within an $O(\log^4 k)$ factor and the distortion to within a constant factor. The main technical component in our result is an approach for constructing a composition of two given embeddings from subsets of $X$ into $\ell_2$ which inherits the distortions of each to within small multiplicative factors. Specifically, given a low $c_S$ distortion embedding from $S\subset X$ into $\ell_2$ and a high(er) $c_X$ distortion embedding from the entire set $X$ into $\ell_2$, we construct a single embedding that achieves the same distortion $c_S$ over pairs of points in $S$ and an expansion of at most $O(\log k)\cdot c_X$ over the remaining pairs of points, where $k=|X\setminus S|$. Our composition theorem extends to embeddings into arbitrary $\ell_p$ metrics for $p\ge 1$, and may be of independent interest. While unions of embeddings over disjoint sets have been studied previously, to our knowledge, this is the first work to consider compositions of nested embeddings.
We consider the problem of low-rank rectangular matrix completion in the regime where the matrix $M$ of size $n\times m$ is ``long", i.e., the aspect ratio $m/n$ diverges to infinity. Such matrices are of particular interest in the study of tensor completion, where they arise from the unfolding of a low-rank tensor. In the case where the sampling probability is $\frac{d}{\sqrt{mn}}$, we propose a new spectral algorithm for recovering the singular values and left singular vectors of the original matrix $M$ based on a variant of the standard non-backtracking operator of a suitably defined bipartite weighted random graph, which we call a \textit{non-backtracking wedge operator}. When $d$ is above a Kesten-Stigum-type sampling threshold, our algorithm recovers a correlated version of the singular value decomposition of $M$ with quantifiable error bounds. This is the first result in the regime of bounded $d$ for weak recovery and the first for weak consistency when $d\to\infty$ arbitrarily slowly without any polylog factors. As an application, for low-rank orthogonal $k$-tensor completion, we efficiently achieve weak recovery with sample size $O(n^{k/2})$, and weak consistency with sample size $\omega(n^{k/2})$.
For many applications involving a sequence of linear systems with slowly changing system matrices, subspace recycling, which exploits relationships among systems and reuses search space information, can achieve huge gains in iterations across the total number of linear system solves in the sequence. However, for general (i.e., non-identity) shifted systems with the shift value varying over a wide range, the properties of the linear systems vary widely as well, which makes recycling less effective. If such a sequence of systems is embedded in a nonlinear iteration, the problem is compounded, and special approaches are needed to use recycling effectively. In this paper, we develop new, more efficient, Krylov subspace recycling approaches for large-scale image reconstruction and restoration techniques that employ a nonlinear iteration to compute a suitable regularization matrix. For each new regularization matrix, we need to solve regularized linear systems, ${\bf A} + \gamma_\ell {\bf E}_k$, for a sequence of regularization parameters, $\gamma_\ell$, to find the optimally regularized solution that, in turn, will be used to update the regularization matrix. In this paper, we analyze system and solution characteristics to choose appropriate techniques to solve each system rapidly. Specifically, we use an inner-outer recycling approach with a larger, principal recycle space for each nonlinear step and smaller recycle spaces for each shift. We propose an efficient way to obtain good initial guesses from the principle recycle space and smaller shift-specific recycle spaces that lead to fast convergence. Our method is substantially reduces the total number of matrix-vector products that would arise in a naive approach. Our approach is more generally applicable to sequences of shifted systems where the matrices in the sum are positive semi-definite.
We consider the classical problem of heteroscedastic linear regression, where we are given $n$ samples $(\mathbf{x}_i, y_i) \in \mathbb{R}^d \times \mathbb{R}$ obtained from $y_i = \langle \mathbf{w}^{*}, \mathbf{x}_i \rangle + \epsilon_i \cdot \langle \mathbf{f}^{*}, \mathbf{x}_i \rangle$, where $\mathbf{x}_i \sim N(0,\mathbf{I})$, $\epsilon_i \sim N(0,1)$, and our task is to estimate $\mathbf{w}^{*}$. In addition to the classical applications of heteroscedastic models in fields such as statistics, econometrics, time series analysis etc., it is also particularly relevant in machine learning when data is collected from multiple sources of varying but apriori unknown quality, e.g., large model training. Our work shows that we can estimate $\mathbf{w}^{*}$ in squared norm up to an error of $\tilde{O}\left(\|\mathbf{f}^{*}\|^2 \cdot \left(\frac{1}{n} + \left(\frac{d}{n}\right)^2\right)\right)$ and prove a matching lower bound (up to logarithmic factors). Our result substantially improves upon the previous best known upper bound of $\tilde{O}\left(\|\mathbf{f}^{*}\|^2\cdot \frac{d}{n}\right)$. Our upper bound result is based on a novel analysis of a simple, classical heuristic going back to at least Davidian and Carroll (1987) and constitutes the first non-asymptotic convergence guarantee for this approach. As a byproduct, our analysis also provides improved rates of estimation for both linear regression and phase retrieval with multiplicative noise, which maybe of independent interest. The lower bound result relies on a careful application of LeCam's two point method, adapted to work with heavy tailed random variables where the relevant mutual information quantities are infinite (precluding a direct application of LeCam's method), and could also be of broader interest.
Discretization of the uniform norm of functions from a given finite dimensional subspace of continuous functions is studied. Previous known results show that for any $N$-dimensional subspace of the space of continuous functions it is sufficient to use $e^{CN}$ sample points for an accurate upper bound for the uniform norm by the discrete norm and that one cannot improve on the exponential growth of the number of sampling points for a good discretization theorem in the uniform norm. In this paper we focus on two types of results, which allow us to obtain good discretization of the uniform norm with polynomial in $N$ number of points. In the first way we weaken the discretization inequality by allowing a bound of the uniform norm by the discrete norm multiplied by an extra factor, which may depend on $N$. In the second way we impose restrictions on the finite dimensional subspace under consideration. In particular, we prove a general result, which connects the upper bound on the number of sampling points in the discretization theorem for the uniform norm with the best $m$-term bilinear approximation of the Dirichlet kernel associated with the given subspace.
We derive optimality conditions for the optimum sample allocation problem, formulated as the determination of the fixed strata sample sizes that minimize the total cost of the survey, under assumed level of the variance of the stratified estimator and one-sided upper bounds imposed on sample sizes in strata. In this context, we take that the variance function is of some generic form that involves the stratified $\pi$ estimator of the population total with stratified simple random sampling without replacement design as a special case. The optimality conditions mentioned above will be derived with the use of convex optimization theory and the Karush-Kuhn-Tucker conditions. Based on the established optimality conditions we give a formal proof of the existing procedure, termed here as LRNA, that solves the allocation problem considered. We formulate the LRNA in such a way that it also provides the solution to classical optimum allocation problem (i.e. minimization of the estimator's variance under fixed total cost) under one-sided lower bounds imposed on sample sizes in strata. From this standpoint, the LRNA can be considered as a counterparty to the popular recursive Neyman allocation procedure that is used to solve the classical problem of optimum sample allocation but with one-sided upper bounds. Ready-to-use R-implementation of the LRNA is available through our package stratallo, which is published on the Comprehensive R Archive Network (CRAN) package repository.
We consider the computational efficiency of Monte Carlo (MC) and Multilevel Monte Carlo (MLMC) methods applied to partial differential equations with random coefficients. These arise, for example, in groundwater flow modelling, where a commonly used model for the unknown parameter is a random field. We make use of the circulant embedding procedure for sampling from the aforementioned coefficient. To improve the computational complexity of the MLMC estimator in the case of highly oscillatory random fields, we devise and implement a smoothing technique integrated into the circulant embedding method. This allows to choose the coarsest mesh on the first level of MLMC independently of the correlation length of the covariance function of the random field, leading to considerable savings in computational cost. We illustrate this with numerical experiments, where we see a saving of factor 5-10 in computational cost for accuracies of practical interest.
In 1954, Alston S. Householder published Principles of Numerical Analysis, one of the first modern treatments on matrix decomposition that favored a (block) LU decomposition-the factorization of a matrix into the product of lower and upper triangular matrices. And now, matrix decomposition has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this survey is to give a self-contained introduction to concepts and mathematical tools in numerical linear algebra and matrix analysis in order to seamlessly introduce matrix decomposition techniques and their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning matrix decomposition and given the paucity of scope to present this discussion, e.g., the separated analysis of the Euclidean space, Hermitian space, Hilbert space, and things in the complex domain. We refer the reader to literature in the field of linear algebra for a more detailed introduction to the related fields.