露脸视频一区二区三区在线播放_啊灬啊灬啊灬快灬深用两性_在线WWW 天堂网在线_欧美日韩一区二区中文字幕视频_亚洲日韩精品毛片一区二区三区_欧美A级一区二区三区中文观看_97视频在线精品国自产拍台湾

In this paper, we apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need for the first time to a data-driven operator learning problem related to partial differential equations. An effort is put together to explain the heuristics of, and to improve the efficacy of the attention mechanism. By employing the operator approximation theory in Hilbert spaces, it is demonstrated for the first time that the softmax normalization in the scaled dot-product attention is sufficient but not necessary. Without softmax, the approximation capacity of a linearized Transformer variant can be proved to be comparable to a Petrov-Galerkin projection layer-wise, and the estimate is independent with respect to the sequence length. A new layer normalization scheme mimicking the Petrov-Galerkin projection is proposed to allow a scaling to propagate through attention layers, which helps the model achieve remarkable accuracy in operator learning tasks with unnormalized data. Finally, we present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem. The newly proposed simple attention-based operator learner, Galerkin Transformer, shows significant improvements in both training cost and evaluation accuracy over its softmax-normalized counterparts.

相關內容

注意力機(ji)制

關注 120

Attention機(ji)(ji)制(zhi)最早(zao)是(shi)在視(shi)覺圖像領域提出來的(de)(de)，但是(shi)真正火(huo)起來應(ying)該算(suan)是(shi)google mind團隊的(de)(de)這篇論文(wen)(wen)《Recurrent Models of Visual Attention》[14]，他(ta)(ta)們在RNN模型上(shang)使用(yong)了attention機(ji)(ji)制(zhi)來進(jin)行圖像分(fen)類。隨后，Bahdanau等人在論文(wen)(wen)《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用(yong)類似attention的(de)(de)機(ji)(ji)制(zhi)在機(ji)(ji)器翻譯(yi)任務(wu)上(shang)將翻譯(yi)和對齊同時進(jin)行，他(ta)(ta)們的(de)(de)工作(zuo)算(suan)是(shi)是(shi)第一個提出attention機(ji)(ji)制(zhi)應(ying)用(yong)到NLP領域中。接(jie)著類似的(de)(de)基于attention機(ji)(ji)制(zhi)的(de)(de)RNN模型擴(kuo)展(zhan)開始應(ying)用(yong)到各種(zhong)NLP任務(wu)中。最近，如何在CNN中使用(yong)attention機(ji)(ji)制(zhi)也成為了大(da)家(jia)的(de)(de)研究(jiu)熱點(dian)。下圖表示了attention研究(jiu)進(jin)展(zhan)的(de)(de)大(da)概趨勢(shi)。

估計/估計量 · 統計量 · 泰勒 · 自助法/自舉法 · 矩 ·

2021 年 12 月 31 日

High-Order Statistical Functional Expansion and Its Application To Some Nonsmooth Problems

Fan Zhou,Ping Li,Cun-Hui Zhang

Let $\bx_j = \btheta +\bep_j, j=1,...,n$, be observations of an unknown parameter $\btheta$ in a Euclidean or separable Hilbert space $\scrH$, where $\bep_j$ are noises as random elements in $\scrH$ from a general distribution. We study the estimation of $f(\btheta)$ for a given functional $f:\scrH\rightarrow \RR$ based on $\bx_j$'s. The key element of our approach is a new method which we call High-Order Degenerate Statistical Expansion. It leverages the use of classical multivariate Taylor expansion and degenerate $U$-statistic and yields an elegant explicit formula. In the univariate case of $\scrH=\R$, the formula expresses the error of the proposed estimator as a sum of order $k$ degenerate $U$-products of the noises with coefficient $f^{(k)}(\btheta)/k!$ and an explicit remainder term in the form of the Riemann-Liouville integral as in the Taylor expansion around the true $\btheta$. For general $\scrH$, the formula expresses the estimation error in terms of the inner product of $f^{(k)}(\btheta)/k!$ and the average of the tensor products of $k$ noises with distinct indices and a parallel extension of the remainder term from the univariate case. This makes the proposed method a natural statistical version of the classical Taylor expansion. The proposed estimator can be viewed as a jackknife estimator of an ideal degenerate expansion of $f(\cdot)$ around the true $\btheta$ with the degenerate $U$-product of the noises, and can be approximated by bootstrap. Thus, the jackknife, bootstrap and Taylor expansion approaches all converge to the proposed estimator. We develop risk bounds for the proposed estimator and a central limit theorem under a second moment condition (even in expansions of higher than the second order). We apply this new method to generalize several existing results with smooth and nonsmooth $f$ to universal $\bep_j$'s with only minimum moment constraints.

近似 · CASE · Performer · CASES · 優化器 ·

2021 年 12 月 30 日

DPG methods for a fourth-order div problem

Thomas Führer,Pablo Herrera,Norbert Heuer

from arxiv, Supported by ANID-Chile through FONDECYT projects 1190009, 1210391

We study a fourth-order div problem and its approximation by the discontinuous Petrov-Galerkin method with optimal test functions. We present two variants, based on first and second-order systems. In both cases we prove well-posedness of the formulation and quasi-optimal convergence of the approximation. Our analysis includes the fully-discrete schemes with approximated test functions, for general dimension and polynomial degree in the first-order case, and for two dimensions and lowest-order approximation in the second-order case. Numerical results illustrate the performance for quasi-uniform and adaptively refined meshes.

估計/估計量 · MoDELS · 模型平均 · 優化器 · Weight ·

2021 年 12 月 30 日

Optimal model averaging for single-index models with divergent dimensions

Jiahui Zou,Wendun Wang,Xinyu Zhang,Guohua Zou

This paper offers a new approach to address the model uncertainty in (potentially) divergent-dimensional single-index models (SIMs). We propose a model-averaging estimator based on cross-validation, which allows the dimension of covariates and the number of candidate models to increase with the sample size. We show that when all candidate models are misspecified, our model-averaging estimator is asymptotically optimal in the sense that its squared loss is asymptotically identical to that of the infeasible best possible averaging estimator. In a different situation where correct models are available in the model set, the proposed weighting scheme assigns all weights to the correct models in the asymptotic sense. We also extend our method to average regularized estimators and propose pre-screening methods to deal with cases with high-dimensional covariates. We illustrate the merits of our method via simulations and two empirical applications.

線性的 · 閾值 · 矩陣論 · 可辨認的 · 均勻采樣 ·

2021 年 12 月 29 日

On Local Convergence of Iterative Hard Thresholding for Matrix Completion

Trung Vu,Raviv Raich

from arxiv, 14 pages in double-column format

Iterative hard thresholding (IHT) has gained in popularity over the past decades in large-scale optimization. However, convergence properties of this method have only been explored recently in non-convex settings. In matrix completion, existing works often focus on the guarantee of global convergence of IHT via standard assumptions such as incoherence property and uniform sampling. While such analysis provides a global upper bound on the linear convergence rate, it does not describe the actual performance of IHT in practice. In this paper, we provide a novel insight into the local convergence of a specific variant of IHT for matrix completion. We uncover the exact linear rate of IHT in a closed-form expression and identify the region of convergence in which the algorithm is guaranteed to converge. Furthermore, we utilize random matrix theory to study the linear rate of convergence of IHTSVD for large-scale matrix completion. We find that asymptotically, the rate can be expressed in closed form in terms of the relative rank and the sampling rate. Finally, we present various numerical results to verify the aforementioned theoretical analysis.

損失函數（機器學習） · 泛函 · Networking · Neural Networks · 線性的 ·

2021 年 12 月 29 日

Deep adaptive basis Galerkin method for high-dimensional evolution equations with oscillatory solutions

Yiqi Gu,Micheal K. Ng

In this paper, we study deep neural networks (DNNs) for solving high-dimensional evolution equations with oscillatory solutions. Different from deep least-squares methods that deal with time and space variables simultaneously, we propose a deep adaptive basis Galerkin (DABG) method which employs the spectral-Galerkin method for time variable by tensor-product basis for oscillatory solutions and the deep neural network method for high-dimensional space variables. The proposed method can lead to a linear system of differential equations having unknown DNNs that can be trained via the loss function. We establish a posterior estimates of the solution error which is bounded by the minimal loss function and the term $O(N^{-m})$, where $N$ is the number of basis functions and $m$ characterizes the regularity of the equation, and show that if the true solution is a Barron-type function, the error bound converges to zero as $M=O(N^p)$ approaches to infinity where $M$ is the width of the used networks and $p$ is a positive constant. Numerical examples including high-dimensional linear parabolic and hyperbolic equations, and nonlinear Allen-Cahn equation are presented to demonstrate the performance of the proposed DABG method is better than that of existing DNNs.

正則化項 · 近似 · 估計/估計量 · Principle · CASE ·

2021 年 12 月 27 日

A unified framework for the regularization of final value time-fractional diffusion equation

Walter Simo Tao Lee

This paper focuses on the regularization of backward time-fractional diffusion problem on unbounded domain. This problem is well-known to be ill-posed, whence the need of a regularization method in order to recover stable approximate solution. For the problem under consideration, we present a unified framework of regularization which covers some techniques such as Fourier regularization [19], mollification [12] and approximate-inverse [7]. We investigate a regularization technique with two major advantages: the simplicity of computation of the regularized solution and the avoid of truncation of high frequency components (so as to avoid undesirable oscillation on the resulting approximate-solution). Under classical Sobolev-smoothness conditions, we derive order-optimal error estimates between the approximate solution and the exact solution in the case where both the data and the model are only approximately known. In addition, an order-optimal a-posteriori parameter choice rule based on the Morozov principle is given. Finally, via some numerical experiments in two-dimensional space, we illustrate the efficiency of our regularization approach and we numerically confirm the theoretical convergence rates established in the paper.

全局極小值 · 優化器 · 極小值 · 非凸 · 近似 ·

2021 年 3 月 24 日

Why Do Local Methods Solve Nonconvex Problems?

Tengyu Ma

from arxiv, This is the Chapter 21 of the book "Beyond the Worst-Case Analysis of Algorithms"

Non-convex optimization is ubiquitous in modern machine learning. Researchers devise non-convex objective functions and optimize them using off-the-shelf optimizers such as stochastic gradient descent and its variants, which leverage the local geometry and update iteratively. Even though solving non-convex functions is NP-hard in the worst case, the optimization quality in practice is often not an issue -- optimizers are largely believed to find approximate global minima. Researchers hypothesize a unified explanation for this intriguing phenomenon: most of the local minima of the practically-used objectives are approximately global minima. We rigorously formalize it for concrete instances of machine learning problems.

視頻描述生成（Video Caption） · 變換 · 稀疏 · 多峰值 · 可約的 ·

2020 年 7 月 23 日

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Tao Jin,Siyu Huang,Ming Chen,Yingming Li,Zhongfei Zhang

from arxiv, Appearing at IJCAI 2020

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

重要性采樣 · 樣本空間 · 方差減小 · 樣本 · 蒙特卡羅 ·

2018 年 8 月 23 日

Learning to Importance Sample in Primary Sample Space

Quan Zheng,Matthias Zwicker

from arxiv, Submitted to SIGGRAPH ASIA'18

Importance sampling is one of the most widely used variance reduction strategies in Monte Carlo rendering. In this paper, we propose a novel importance sampling technique that uses a neural network to learn how to sample from a desired density represented by a set of samples. Our approach considers an existing Monte Carlo rendering algorithm as a black box. During a scene-dependent training phase, we learn to generate samples with a desired density in the primary sample space of the rendering algorithm using maximum likelihood estimation. We leverage a recent neural network architecture that was designed to represent real-valued non-volume preserving ('Real NVP') transformations in high dimensional spaces. We use Real NVP to non-linearly warp primary sample space and obtain desired densities. In addition, Real NVP efficiently computes the determinant of the Jacobian of the warp, which is required to implement the change of integration variables implied by the warp. A main advantage of our approach is that it is agnostic of underlying light transport effects, and can be combined with many existing rendering techniques by treating them as a black box. We show that our approach leads to effective variance reduction in several practical scenarios.

平滑 · 注意力機制 · 反向傳播 · 維特比算法 · 正則化項 ·

2018 年 2 月 20 日

Differentiable Dynamic Programming for Structured Prediction and Attention

Arthur Mensch,Mathieu Blondel

Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.