斗破苍穹第四季25集免费观看,中文字幕免费黄色网站,成人H动漫在线播放无码,久久精品动漫一区二区无码

In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix $\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$, where $A = \exp(QK^\top/d)$ is the `attention matrix', and $\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the $n \times n$ attention matrix $A$, and hence require time $\Omega(n^2)$ even when $d = n^{o(1)}$ is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix $A$. We present two results, showing that there is a sharp transition at $B = \Theta(\sqrt{\log n})$. $\bullet$ If $d = O(\log n)$ and $B = o(\sqrt{\log n})$, there is an $n^{1+o(1)}$ time algorithm to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error. $\bullet$ If $d = O(\log n)$ and $B = \Theta (\sqrt{\log n})$, assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error in truly subquadratic time $n^{2 - \Omega(1)}$. This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

相關內容

Attention

關注 1

可約的 · Learning · MoDELS · 查準率/準確率 · Integration ·

2023 年 6 月 28 日

A DeepONet multi-fidelity approach for residual learning in reduced order modeling

Nicola Demo,Marco Tezzele,Gianluigi Rozza

In the present work, we introduce a novel approach to enhance the precision of reduced order models by exploiting a multi-fidelity perspective and DeepONets. Reduced models provide a real-time numerical approximation by simplifying the original model. The error introduced by the such operation is usually neglected and sacrificed in order to reach a fast computation. We propose to couple the model reduction to a machine learning residual learning, such that the above-mentioned error can be learned by a neural network and inferred for new predictions. We emphasize that the framework maximizes the exploitation of high-fidelity information, using it for building the reduced order model and for learning the residual. In this work, we explore the integration of proper orthogonal decomposition (POD), and gappy POD for sensors data, with the recent DeepONet architecture. Numerical investigations for a parametric benchmark function and a nonlinear parametric Navier-Stokes problem are presented.

核化 · FAST · Performer · Attention · Analysis ·

2023 年 6 月 27 日

FLuRKA: Fast fused Low-Rank & Kernel Attention

Ahan Gupta,Yueming Yuan,Yanqi Zhou,Charith Mendis

from arxiv, 9 pages, 4 figures

Many efficient approximate self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its own strengths. We observe these strengths synergistically complement each other and exploit these synergies to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA provide sizable performance gains over these approximate techniques and are of high quality. We theoretically and empirically evaluate both the runtime performance and quality of FLuRKA. Our runtime analysis posits a variety of parameter configurations where FLuRKA exhibit speedups and our accuracy analysis bounds the error of FLuRKA with respect to full-attention. We instantiate three FLuRKA variants which experience empirical speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 30x over models with full-attention. With respect to model quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE after pre-training on wiki-text 103. When pre-training on a fixed time budget, FLuRKA yield better perplexity scores than models with full-attention.

Er · 圖 · INFORMS · 樣本 · Weight ·

2023 年 6 月 27 日

Effective resistance in metric spaces

Robi Bhattacharjee,Alexander Cloninger,Yoav Freund,Andreas Oslandsbotn

Effective resistance (ER) is an attractive way to interrogate the structure of graphs. It is an alternative to computing the eigenvectors of the graph Laplacian. One attractive application of ER is to point clouds, i.e. graphs whose vertices correspond to IID samples from a distribution over a metric space. Unfortunately, it was shown that the ER between any two points converges to a trivial quantity that holds no information about the graph's structure as the size of the sample increases to infinity. In this study, we show that this trivial solution can be circumvented by considering a region-based ER between pairs of small regions rather than pairs of points and by scaling the edge weights appropriately with respect to the underlying density in each region. By keeping the regions fixed, we show analytically that the region-based ER converges to a non-trivial limit as the number of points increases to infinity. Namely the ER on a metric space. We support our theoretical findings with numerical experiments.

近似 · 加權距離 · Weight · 近鄰 · 散度 ·

2023 年 6 月 27 日

Approximate Nearest Neighbor Searching with Non-Euclidean and Weighted Distances

Ahmed Abdelkader,Sunil Arya,Guilherme D. da Fonseca,David M. Mount

We present a new approach to approximate nearest-neighbor queries in fixed dimension under a variety of non-Euclidean distances. We are given a set $S$ of $n$ points in $\mathbb{R}^d$, an approximation parameter $\varepsilon > 0$, and a distance function that satisfies certain smoothness and growth-rate assumptions. The objective is to preprocess $S$ into a data structure so that for any query point $q$ in $\mathbb{R}^d$, it is possible to efficiently report any point of $S$ whose distance from $q$ is within a factor of $1+\varepsilon$ of the actual closest point. Prior to this work, the most efficient data structures for approximate nearest-neighbor searching in spaces of constant dimensionality applied only to the Euclidean metric. This paper overcomes this limitation through a method called convexification. For admissible distance functions, the proposed data structures answer queries in logarithmic time using $O(n \log (1 / \varepsilon) / \varepsilon^{d/2})$ space, nearly matching the best known bounds for the Euclidean metric. These results apply to both convex scaling distance functions (including the Mahalanobis distance and weighted Minkowski metrics) and Bregman divergences (including the Kullback-Leibler divergence and the Itakura-Saito distance).

Lipschitz · CASES · 平滑 · 論文 · 數值分析 ·

2023 年 6 月 27 日

Oscillations and differences in Triebel-Lizorkin-Morrey spaces

Marc Hovemann,Markus Weimar

from arxiv, 41 pages

In this paper we are concerned with Triebel-Lizorkin-Morrey spaces $\mathcal{E}^{s}_{u,p,q}(\Omega)$ of positive smoothness $s$ defined on (special or bounded) Lipschitz domains $\Omega\subset\mathbb{R}^d$ as well as on $\mathbb{R}^d$. For those spaces we prove new equivalent characterizations in terms of local oscillations which hold as long as some standard conditions on the parameters are fulfilled. As a byproduct, we also obtain novel characterizations of $\mathcal{E}^{s}_{u,p,q}(\Omega)$ using differences of higher order. Special cases include standard Triebel-Lizorkin spaces $F^s_{p,q} (\Omega)$ and hence classical $L_p$-Sobolev spaces $H^s_p(\Omega)$. Key words: Triebel-Lizorkin-Morrey space, Morrey space, Lipschitz domain, oscillations, higher order differences

CASE · 情景 · 極大 · 論文 ·

2023 年 6 月 26 日

The 2-MAXSAT Problem Can Be Solved in Polynomial Time

Yangjun Chen

By the MAXSAT problem, we are given a set $V$ of $m$ variables and a collection $C$ of $n$ clauses over $V$. We will seek a truth assignment to maximize the number of satisfied clauses. This problem is $\textit{NP}$-hard even for its restricted version, the 2-MAXSAT problem by which every clause contains at most 2 literals. In this paper, we discuss an efficient algorithm to solve this problem. Its worst case time complexity is bounded by O($n^2m^3(log_2\;nm)^{log_2\;nm}$). This shows that the 2-MAXSAT problem can be solved in polynomial time.

整數線性規劃 · 線性的 · 貪心逐層預訓練 · 貪心 · 約束 ·

2023 年 6 月 26 日

Generator Matrices by Solving Integer Linear Programs

Lo?s Paulin,David Coeurjolly,Nicolas Bonneel,Jean-Claude Iehl,Victor Ostromoukhov,Alexander Keller

from arxiv, 17 pages

In quasi-Monte Carlo methods, generating high-dimensional low discrepancy sequences by generator matrices is a popular and efficient approach. Historically, constructing or finding such generator matrices has been a hard problem. In particular, it is challenging to take advantage of the intrinsic structure of a given numerical problem to design samplers of low discrepancy in certain subsets of dimensions. To address this issue, we devise a greedy algorithm allowing us to translate desired net properties into linear constraints on the generator matrix entries. Solving the resulting integer linear program yields generator matrices that satisfy the desired net properties. We demonstrate that our method finds generator matrices in challenging settings, offering low discrepancy sequences beyond the limitations of classic constructions.

秩 · 線性的 · 樣例 · motivation · TCS ·

2023 年 6 月 26 日

On Linear spaces of of matrices bounded rank

Hang Huang,J. M. Landsberg

Motivated by questions in theoretical computer science and quantum information theory, we study the classical problem of determining linear spaces of matrices of bounded rank. Spaces of bounded rank three were classified in 1983, and it has been a longstanding problem to classify spaces of bounded rank four. Before our study, no non-classical example of such a space was known. We exhibit two non-classical examples of such spaces and give the full classification of basic spaces of bounded rank four. There are exactly four such up to isomorphism. We also take steps to bring together the methods of the linear algebra community and the algebraic geometry community used to study spaces of bounded rank.

Attention · 分離的 · 邊緣化 · 詞元分析器 · 極大 ·

2023 年 6 月 23 日

Margin Maximization in Attention Mechanism

Davoud Ataee Tarzanagh,Yingcong Li,Xuechen Zhang,Samet Oymak

Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where, $\boldsymbol{X}$ is the token sequence and $(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are tunable parameters. We prove that running gradient descent on $\boldsymbol{p}$, or equivalently $\boldsymbol{W}$, converges in direction to a max-margin solution that separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly formalizes attention as a token separation mechanism. Remarkably, our results are applicable to general data and precisely characterize $\textit{optimality}$ of tokens in terms of the value embeddings $\boldsymbol{Xv}$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$ and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$ is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we verify our theoretical findings via numerical experiments and provide insights.

可約的 · Extensibility · Continuity · Analysis · 模型評估 ·

2023 年 6 月 23 日

A selectively reduced degree basis for efficient mixed nonlinear isogeometric beam formulations with extensible directors

Myung-Jin Choi,Roger A. Sauer,Sven Klinkel

from arxiv, 50 pages, 23 figures

The effect of higher order continuity in the solution field by using NURBS basis function in isogeometric analysis (IGA) is investigated for an efficient mixed finite element formulation for elastostatic beams. It is based on the Hu-Washizu variational principle considering geometrical and material nonlinearities. Here we present a reduced degree of basis functions for the additional fields of the stress resultants and strains of the beam, which are allowed to be discontinuous across elements. This approach turns out to significantly improve the computational efficiency and the accuracy of the results. We consider a beam formulation with extensible directors, where cross-sectional strains are enriched to avoid Poisson locking by an enhanced assumed strain method. In numerical examples, we show the superior per degree-of-freedom accuracy of IGA over conventional finite element analysis, due to the higher order continuity in the displacement field. We further verify the efficient rotational coupling between beams, as well as the path-independence of the results.