Attention Is All You Need | This Is Where We Start

type

status

date

slug

summary

category

icon

password

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder...

https://arxiv.org/abs/1706.03762

The Annotated Transformer

The Transformer has been on a lot of people’s minds over the last year five years. This post presents an annotated version of the paper in the form of a line-by-line implementation. It reorders and deletes some sections from the original paper and adds comments throughout. This document itself is a working notebook, and should be a completely usable implementation. Code is available here.

https://nlp.seas.harvard.edu/annotated-transformer/

encoder and decoder

entirely rely on an attention mechanism to draw a global dependencies between input and output

Learning process

Encoder & Decoder

Attention

Embeddings & Softmax

FFN

Positional Encoding

Model Architecture

start from a standard Encoder-Decoder architecture

Encoder and Decoder Stacks

For each layer we use a residual connection around each of the two sub-layer, followed by layer normalization

a residual connection means

Encoder

Each layer = multi-head self-attention mechanism + position-wise fully connected feed-forward

lambda x: 匿名函数

在 Python 中，匿名函数（lambda 函数）是一种没有名字的函数。它使用关键字 lambda 来定义，而不是使用常规的 def 语句。匿名函数通常用于一些简单的功能，比如在需要函数对象的地方简洁地定义小函数。

Decoder

Each layer = self-attn + src-attn + feed-forward

Mask: ensures that the predictions for position can depend only on the known outputs at positions less than

Attention

【官方双语】直观解释注意力机制，Transformer的核心 | 【深度学习第6章】_哔哩哔哩_bilibili

“塔”是什么？“Harry”是谁？怎么想象“一个毛茸茸的蓝色生物漫步于葱郁的森林”？怎样用12288个数字表示出一个细微复杂具体的含义？大语言模型中的注意力机制并没有那么神秘。本视频重点介绍什么是多头/自/交叉注意力。0:00 - 前情提要：词嵌入1:39 - 注意力是什么? Mole是什么？Tower又是什么?4:29 - 注意力模式：“一个毛茸茸的蓝色生物漫步于葱郁的森林”，名词与形容词，查询, 视频播放量 447364、弹幕量 669、点赞数 17451、投硬币枚数 9659、收藏人数 25078、转发人数 4167, 视频作者 3Blue1Brown, 作者简介中国官方账号。深入浅出、直观明了地分享数学之美。资助页面：www.patreon.com/3blue1brown，相关视频：【官方双语】GPT是什么？直观解释Transformer | 深度学习第5章，注意力机制的本质|Self-Attention|Transformer|QKV矩阵，【官方双语】直观解释大语言模型如何储存事实 | 【深度学习第7章】，从编解码和词嵌入开始，一步一步理解Transformer，注意力机制(Attention)的本质是卷积神经网络(CNN)，Transformer从零详细解读(可能是你见过最通俗易懂的讲解)，超强动画，一步一步深入浅出解释Transformer原理！，麻省理工学院 - MIT - 线性代数（我愿称之为线性代数教程天花板），【官方双语】深度学习之神经网络的结构 Part 1 ver 2.0，[双语字幕]吴恩达深度学习deeplearning.ai，Transformer论文逐段精读【论文精读】

https://www.bilibili.com/video/BV1TZ421j7Ke/?spm_id_from=333.999.0.0&vd_source=14ad5ada89d0491ad8ab06103ead6ad6

Scaled Dot-Product Attention

: query : keys : values

Why ?

Multi-head attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Position-wise Feed-Forward Networks

FFNs are applied to each position separately and identically

FFN = 2 linear transformations + RELU

Embeddings and Softmax

lut stands for “look up table”

In the embedding layers, we multiply those weights by

Positional Encoding

to make use of the order of the sequence

inject some information about the relative or absolute position of the tokens in the sequence

🗽Attention Is All You Need

Model Architecture

Encoder and Decoder Stacks

Attention

Position-wise Feed-Forward Networks

Embeddings and Softmax

Positional Encoding

Full Model

Original Paper