We study the spectral properties of a class of random matrices of the form $S_n^{-} = n^{-1}(X_1 X_2^* - X_2 X_1^*)$ where $X_k = \Sigma^{1/2}Z_k$, for $k=1,2$, $Z_k$'s are independent $p\times n$ complex-valued random matrices, and $\Sigma$ is a $p\times p$ positive semi-definite matrix, independent of the $Z_k$'s. We assume that $Z_k$'s have independent entries with zero mean and unit variance. The skew-symmetric/skew-Hermitian matrix $S_n^{-}$ will be referred to as a random commutator matrix associated with the samples $X_1$ and $X_2$. We show that, when the dimension $p$ and sample size $n$ increase simultaneously, so that $p/n \to c \in (0,\infty)$, there exists a limiting spectral distribution (LSD) for $S_n^{-}$, supported on the imaginary axis, under the assumptions that the spectral distribution of $\Sigma$ converges weakly and the entries of $Z_k$'s have moments of sufficiently high order. This nonrandom LSD can be described through its Stieltjes transform, which satisfies a coupled Mar\v{c}enko-Pastur-type functional equations. In the special case when $\Sigma = I_p$, we show that the LSD of $S_n^{-}$ is a mixture of a degenerate distribution at zero (with positive mass if $c > 2$), and a continuous distribution with a symmetric density function supported on a compact interval on the imaginary axis. Moreover, we show that the companion matrix $S_n^{+} = \Sigma_n^\frac{1}{2}(Z_1Z_2^* + Z_2Z_1^*)\Sigma_n^\frac{1}{2}$, under identical assumptions, has an LSD supported on the real line, which can be similarly characterized.
We initiate the study of the following general clustering problem. We seek to partition a given set $P$ of data points into $k$ clusters by finding a set $X$ of $k$ centers and assigning each data point to one of the centers. The cost of a cluster, represented by a center $x\in X$, is a monotone, symmetric norm $f$ (inner norm) of the vector of distances of points assigned to $x$. The goal is to minimize a norm $g$ (outer norm) of the vector of cluster costs. This problem, which we call $(f,g)$-Clustering, generalizes many fundamental clustering problems such as $k$-Center, $k$-Median , Min-Sum of Radii, and Min-Load $k$-Clustering . A recent line of research (Chakrabarty, Swamy [STOC'19]) studies norm objectives that are oblivious to the cluster structure such as $k$-Median and $k$-Center. In contrast, our problem models cluster-aware objectives including Min-Sum of Radii and Min-Load $k$-Clustering. Our main results are as follows. First, we design a constant-factor approximation algorithm for $(\textsf{top}_\ell,\mathcal{L}_1)$-Clustering where the inner norm ($\textsf{top}_\ell$) sums over the $\ell$ largest distances. Second, we design a constant-factor approximation\ for $(\mathcal{L}_\infty,\textsf{Ord})$-Clustering where the outer norm is a convex combination of $\textsf{top}_\ell$ norms (ordered weighted norm).
The singular value decomposition (SVD) allows to put a matrix as a product of three matrices: a matrix with the left singular vectors, a matrix with the positive-valued singular values and a matrix with the right singular vectors. There are two main approaches allowing to get the SVD result: the classical method and the randomized method. The analysis of the classical approach leads to accurate singular values. The randomized approach is especially used for high dimensional matrix and is based on the approximation accuracy without computing necessary all singular values. In this paper, the SVD computation is formalized as an optimization problem and a use of the gradient search algorithm. That results in a power method allowing to get all or the first largest singular values and their associated right vectors. In this iterative search, the accuracy on the singular values and the associated vector matrix depends on the user settings. Two applications of the SVD are the principal component analysis and the autoencoder used in the neural network models.
Given an integer or a non-negative integer solution $x$ to a system $Ax = b$, where the number of non-zero components of $x$ is at most $n$. This paper addresses the following question: How closely can we approximate $b$ with $Ay$, where $y$ is an integer or non-negative integer solution constrained to have at most $k$ non-zero components with $k<n$? We establish upper and lower bounds for this question in general. In specific cases, these bounds match. The key finding is that the quality of the approximation increases exponentially as $k$ goes to $n$.
We study the asymptotic discrepancy of $m \times m$ matrices $A_1,\ldots,A_n$ belonging to the Gaussian orthogonal ensemble, which is a class of random symmetric matrices with independent normally distributed entries. In the setting $m^2 = o(n)$, our results show that there exists a signing $x \in \{\pm1\}^n$ such that the spectral norm of $\sum_{i=1}^n x_iA_i$ is $\Theta(\sqrt{nm}4^{-(1 + o(1))n/m^2})$ with high probability. This is best possible and settles a recent conjecture by Kunisky and Zhang.
The class of type-two basic feasible functionals ($\mathtt{BFF}_2$) is the analogue of $\mathtt{FP}$ (polynomial time functions) for type-2 functionals, that is, functionals that can take (first-order) functions as arguments. $\mathtt{BFF}_2$ can be defined through Oracle Turing machines with running time bounded by second-order polynomials. On the other hand, higher-order term rewriting provides an elegant formalism for expressing higher-order computation. We address the problem of characterizing $\mathtt{BFF}_2$ by higher-order term rewriting. Various kinds of interpretations for first-order term rewriting have been introduced in the literature for proving termination and characterizing first-order complexity classes. In this paper, we consider a recently introduced notion of cost-size interpretations for higher-order term rewriting and see second order rewriting as ways of computing type-2 functionals. We then prove that the class of functionals represented by higher-order terms admitting polynomially bounded cost-size interpretations exactly corresponds to $\mathtt{BFF}_2$.
In this paper, we study the problem of recovering a ground truth high dimensional piecewise linear curve $C^*(t):[0, 1]\to\mathbb{R}^d$ from a high noise Gaussian point cloud with covariance $\sigma^2I$ centered around the curve. We establish that the sample complexity of recovering $C^*$ from data scales with order at least $\sigma^6$. We then show that recovery of a piecewise linear curve from the third moment is locally well-posed, and hence $O(\sigma^6)$ samples is also sufficient for recovery. We propose methods to recover a curve from data based on a fitting to the third moment tensor with a careful initialization strategy and conduct some numerical experiments verifying the ability of our methods to recover curves. All code for our numerical experiments is publicly available on GitHub.
We study the problem of $(\epsilon,\delta)$-certified machine unlearning for minimax models. Most of the existing works focus on unlearning from standard statistical learning models that have a single variable and their unlearning steps hinge on the direct Hessian-based conventional Newton update. We develop a new $(\epsilon,\delta)$-certified machine unlearning algorithm for minimax models. It proposes a minimax unlearning step consisting of a total-Hessian-based complete Newton update and the Gaussian mechanism borrowed from differential privacy. To obtain the unlearning certification, our method injects calibrated Gaussian noises by carefully analyzing the "sensitivity" of the minimax unlearning step (i.e., the closeness between the minimax unlearning variables and the retraining-from-scratch variables). We derive the generalization rates in terms of population strong and weak primal-dual risk for three different cases of loss functions, i.e., (strongly-)convex-(strongly-)concave losses. We also provide the deletion capacity to guarantee that a desired population risk can be maintained as long as the number of deleted samples does not exceed the derived amount. With training samples $n$ and model dimension $d$, it yields the order $\mathcal O(n/d^{1/4})$, which shows a strict gap over the baseline method of differentially private minimax learning that has $\mathcal O(n/d^{1/2})$. In addition, our rates of generalization and deletion capacity match the state-of-the-art rates derived previously for standard statistical learning models.
We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.
We study the average-case version of the Orthogonal Vectors problem, in which one is given as input $n$ vectors from $\{0,1\}^d$ which are chosen randomly so that each coordinate is $1$ independently with probability $p$. Kane and Williams [ITCS 2019] showed how to solve this problem in time $O(n^{2 - \delta_p})$ for a constant $\delta_p > 0$ that depends only on $p$. However, it was previously unclear how to solve the problem faster in the hardest parameter regime where $p$ may depend on $d$. The best prior algorithm was the best worst-case algorithm by Abboud, Williams and Yu [SODA 2014], which in dimension $d = c \cdot \log n$, solves the problem in time $n^{2 - \Omega(1/\log c)}$. In this paper, we give a new algorithm which improves this to $n^{2 - \Omega(\log\log c /\log c)}$ in the average case for any parameter $p$. As in the prior work, our algorithm uses the polynomial method. We make use of a very simple polynomial over the reals, and use a new method to analyze its performance based on computing how its value degrades as the input vectors get farther from orthogonal. To demonstrate the generality of our approach, we also solve the average-case version of the closest pair problem in the same running time.
The problem of recovering a signal $\boldsymbol x\in \mathbb{R}^n$ from a quadratic system $\{y_i=\boldsymbol x^\top\boldsymbol A_i\boldsymbol x,\ i=1,\ldots,m\}$ with full-rank matrices $\boldsymbol A_i$ frequently arises in applications such as unassigned distance geometry and sub-wavelength imaging. With i.i.d. standard Gaussian matrices $\boldsymbol A_i$, this paper addresses the high-dimensional case where $m\ll n$ by incorporating prior knowledge of $\boldsymbol x$. First, we consider a $k$-sparse $\boldsymbol x$ and introduce the thresholded Wirtinger flow (TWF) algorithm that does not require the sparsity level $k$. TWF comprises two steps: the spectral initialization that identifies a point sufficiently close to $\boldsymbol x$ (up to a sign flip) when $m=O(k^2\log n)$, and the thresholded gradient descent which, when provided a good initialization, produces a sequence linearly converging to $\boldsymbol x$ with $m=O(k\log n)$ measurements. Second, we explore the generative prior, assuming that $x$ lies in the range of an $L$-Lipschitz continuous generative model with $k$-dimensional inputs in an $\ell_2$-ball of radius $r$. With an estimate correlated with the signal, we develop the projected gradient descent (PGD) algorithm that also comprises two steps: the projected power method that provides an initial vector with $O\big(\sqrt{\frac{k \log L}{m}}\big)$ $\ell_2$-error given $m=O(k\log(Lnr))$ measurements, and the projected gradient descent that refines the $\ell_2$-error to $O(\delta)$ at a geometric rate when $m=O(k\log\frac{Lrn}{\delta^2})$. Experimental results corroborate our theoretical findings and show that: (i) our approach for the sparse case notably outperforms the existing provable algorithm sparse power factorization; (ii) leveraging the generative prior allows for precise image recovery in the MNIST dataset from a small number of quadratic measurements.