In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix $\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$, where $A = \exp(QK^\top/d)$ is the `attention matrix', and $\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the $n \times n$ attention matrix $A$, and hence require time $\Omega(n^2)$ even when $d = n^{o(1)}$ is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix $A$. We present two results, showing that there is a sharp transition at $B = \Theta(\sqrt{\log n})$. $\bullet$ If $d = O(\log n)$ and $B = o(\sqrt{\log n})$, there is an $n^{1+o(1)}$ time algorithm to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error. $\bullet$ If $d = O(\log n)$ and $B = \Theta (\sqrt{\log n})$, assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error in truly subquadratic time $n^{2 - \Omega(1)}$. This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.
In the present work, we introduce a novel approach to enhance the precision of reduced order models by exploiting a multi-fidelity perspective and DeepONets. Reduced models provide a real-time numerical approximation by simplifying the original model. The error introduced by the such operation is usually neglected and sacrificed in order to reach a fast computation. We propose to couple the model reduction to a machine learning residual learning, such that the above-mentioned error can be learned by a neural network and inferred for new predictions. We emphasize that the framework maximizes the exploitation of high-fidelity information, using it for building the reduced order model and for learning the residual. In this work, we explore the integration of proper orthogonal decomposition (POD), and gappy POD for sensors data, with the recent DeepONet architecture. Numerical investigations for a parametric benchmark function and a nonlinear parametric Navier-Stokes problem are presented.
Many efficient approximate self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its own strengths. We observe these strengths synergistically complement each other and exploit these synergies to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA provide sizable performance gains over these approximate techniques and are of high quality. We theoretically and empirically evaluate both the runtime performance and quality of FLuRKA. Our runtime analysis posits a variety of parameter configurations where FLuRKA exhibit speedups and our accuracy analysis bounds the error of FLuRKA with respect to full-attention. We instantiate three FLuRKA variants which experience empirical speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 30x over models with full-attention. With respect to model quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE after pre-training on wiki-text 103. When pre-training on a fixed time budget, FLuRKA yield better perplexity scores than models with full-attention.
Effective resistance (ER) is an attractive way to interrogate the structure of graphs. It is an alternative to computing the eigenvectors of the graph Laplacian. One attractive application of ER is to point clouds, i.e. graphs whose vertices correspond to IID samples from a distribution over a metric space. Unfortunately, it was shown that the ER between any two points converges to a trivial quantity that holds no information about the graph's structure as the size of the sample increases to infinity. In this study, we show that this trivial solution can be circumvented by considering a region-based ER between pairs of small regions rather than pairs of points and by scaling the edge weights appropriately with respect to the underlying density in each region. By keeping the regions fixed, we show analytically that the region-based ER converges to a non-trivial limit as the number of points increases to infinity. Namely the ER on a metric space. We support our theoretical findings with numerical experiments.
We present a new approach to approximate nearest-neighbor queries in fixed dimension under a variety of non-Euclidean distances. We are given a set $S$ of $n$ points in $\mathbb{R}^d$, an approximation parameter $\varepsilon > 0$, and a distance function that satisfies certain smoothness and growth-rate assumptions. The objective is to preprocess $S$ into a data structure so that for any query point $q$ in $\mathbb{R}^d$, it is possible to efficiently report any point of $S$ whose distance from $q$ is within a factor of $1+\varepsilon$ of the actual closest point. Prior to this work, the most efficient data structures for approximate nearest-neighbor searching in spaces of constant dimensionality applied only to the Euclidean metric. This paper overcomes this limitation through a method called convexification. For admissible distance functions, the proposed data structures answer queries in logarithmic time using $O(n \log (1 / \varepsilon) / \varepsilon^{d/2})$ space, nearly matching the best known bounds for the Euclidean metric. These results apply to both convex scaling distance functions (including the Mahalanobis distance and weighted Minkowski metrics) and Bregman divergences (including the Kullback-Leibler divergence and the Itakura-Saito distance).
In this paper we are concerned with Triebel-Lizorkin-Morrey spaces $\mathcal{E}^{s}_{u,p,q}(\Omega)$ of positive smoothness $s$ defined on (special or bounded) Lipschitz domains $\Omega\subset\mathbb{R}^d$ as well as on $\mathbb{R}^d$. For those spaces we prove new equivalent characterizations in terms of local oscillations which hold as long as some standard conditions on the parameters are fulfilled. As a byproduct, we also obtain novel characterizations of $\mathcal{E}^{s}_{u,p,q}(\Omega)$ using differences of higher order. Special cases include standard Triebel-Lizorkin spaces $F^s_{p,q} (\Omega)$ and hence classical $L_p$-Sobolev spaces $H^s_p(\Omega)$. Key words: Triebel-Lizorkin-Morrey space, Morrey space, Lipschitz domain, oscillations, higher order differences
By the MAXSAT problem, we are given a set $V$ of $m$ variables and a collection $C$ of $n$ clauses over $V$. We will seek a truth assignment to maximize the number of satisfied clauses. This problem is $\textit{NP}$-hard even for its restricted version, the 2-MAXSAT problem by which every clause contains at most 2 literals. In this paper, we discuss an efficient algorithm to solve this problem. Its worst case time complexity is bounded by O($n^2m^3(log_2\;nm)^{log_2\;nm}$). This shows that the 2-MAXSAT problem can be solved in polynomial time.
In quasi-Monte Carlo methods, generating high-dimensional low discrepancy sequences by generator matrices is a popular and efficient approach. Historically, constructing or finding such generator matrices has been a hard problem. In particular, it is challenging to take advantage of the intrinsic structure of a given numerical problem to design samplers of low discrepancy in certain subsets of dimensions. To address this issue, we devise a greedy algorithm allowing us to translate desired net properties into linear constraints on the generator matrix entries. Solving the resulting integer linear program yields generator matrices that satisfy the desired net properties. We demonstrate that our method finds generator matrices in challenging settings, offering low discrepancy sequences beyond the limitations of classic constructions.
Motivated by questions in theoretical computer science and quantum information theory, we study the classical problem of determining linear spaces of matrices of bounded rank. Spaces of bounded rank three were classified in 1983, and it has been a longstanding problem to classify spaces of bounded rank four. Before our study, no non-classical example of such a space was known. We exhibit two non-classical examples of such spaces and give the full classification of basic spaces of bounded rank four. There are exactly four such up to isomorphism. We also take steps to bring together the methods of the linear algebra community and the algebraic geometry community used to study spaces of bounded rank.
Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where, $\boldsymbol{X}$ is the token sequence and $(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are tunable parameters. We prove that running gradient descent on $\boldsymbol{p}$, or equivalently $\boldsymbol{W}$, converges in direction to a max-margin solution that separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly formalizes attention as a token separation mechanism. Remarkably, our results are applicable to general data and precisely characterize $\textit{optimality}$ of tokens in terms of the value embeddings $\boldsymbol{Xv}$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$ and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$ is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we verify our theoretical findings via numerical experiments and provide insights.
The effect of higher order continuity in the solution field by using NURBS basis function in isogeometric analysis (IGA) is investigated for an efficient mixed finite element formulation for elastostatic beams. It is based on the Hu-Washizu variational principle considering geometrical and material nonlinearities. Here we present a reduced degree of basis functions for the additional fields of the stress resultants and strains of the beam, which are allowed to be discontinuous across elements. This approach turns out to significantly improve the computational efficiency and the accuracy of the results. We consider a beam formulation with extensible directors, where cross-sectional strains are enriched to avoid Poisson locking by an enhanced assumed strain method. In numerical examples, we show the superior per degree-of-freedom accuracy of IGA over conventional finite element analysis, due to the higher order continuity in the displacement field. We further verify the efficient rotational coupling between beams, as well as the path-independence of the results.