🗽Attention Is All You Need

type
status
date
slug
summary
tags
category
icon
password
encoder and decoder
entirely rely on an attention mechanism to draw a global dependencies between input and output
 
Learning process
Encoder & Decoder
Attention
Embeddings & Softmax
FFN
Positional Encoding

Model Architecture

notion image
 
start from a standard Encoder-Decoder architecture

 

Encoder and Decoder Stacks

For each layer we use a residual connection around each of the two sub-layer, followed by layer normalization
notion image
a residual connection means
  • Encoder
Each layer = multi-head self-attention mechanism + position-wise fully connected feed-forward
lambda x: 匿名函数
在 Python 中,匿名函数(lambda 函数)是一种没有名字的函数。它使用关键字 lambda 来定义,而不是使用常规的 def 语句。匿名函数通常用于一些简单的功能,比如在需要函数对象的地方简洁地定义小函数。
 
 
  • Decoder
Each layer = self-attn + src-attn + feed-forward
Mask: ensures that the predictions for position can depend only on the known outputs at positions less than
notion image

 

Attention

【官方双语】直观解释注意力机制,Transformer的核心 | 【深度学习第6章】_哔哩哔哩_bilibili
“塔”是什么?“Harry”是谁?怎么想象“一个毛茸茸的蓝色生物漫步于葱郁的森林”?怎样用12288个数字表示出一个细微复杂具体的含义?大语言模型中的注意力机制并没有那么神秘。本视频重点介绍什么是多头/自/交叉注意力。0:00 - 前情提要:词嵌入1:39 - 注意力是什么? Mole是什么?Tower又是什么?4:29 - 注意力模式:“一个毛茸茸的蓝色生物漫步于葱郁的森林”,名词与形容词,查询, 视频播放量 447364、弹幕量 669、点赞数 17451、投硬币枚数 9659、收藏人数 25078、转发人数 4167, 视频作者 3Blue1Brown, 作者简介 中国官方账号。深入浅出、直观明了地分享数学之美。资助页面:www.patreon.com/3blue1brown,相关视频:【官方双语】GPT是什么?直观解释Transformer | 深度学习第5章,注意力机制的本质|Self-Attention|Transformer|QKV矩阵,【官方双语】直观解释大语言模型如何储存事实 | 【深度学习第7章】,从编解码和词嵌入开始,一步一步理解Transformer,注意力机制(Attention)的本质是卷积神经网络(CNN),Transformer从零详细解读(可能是你见过最通俗易懂的讲解),超强动画,一步一步深入浅出解释Transformer原理!,麻省理工学院 - MIT - 线性代数(我愿称之为线性代数教程天花板),【官方双语】深度学习之神经网络的结构 Part 1 ver 2.0,[双语字幕]吴恩达深度学习deeplearning.ai,Transformer论文逐段精读【论文精读】
【官方双语】直观解释注意力机制,Transformer的核心 | 【深度学习第6章】_哔哩哔哩_bilibili
  • Scaled Dot-Product Attention
notion image
: query : keys : values
Why ?
notion image
 
 
  • Multi-head attention
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
notion image
 

 

Position-wise Feed-Forward Networks

FFNs are applied to each position separately and identically
FFN = 2 linear transformations + RELU
 

 

Embeddings and Softmax

lut stands for “look up table”
In the embedding layers, we multiply those weights by

 

Positional Encoding

to make use of the order of the sequence
inject some information about the relative or absolute position of the tokens in the sequence
 

 

Full Model


 

Original Paper

上一篇
Database System Notebook
下一篇
pi0 : Our First Generalist Policy
Loading...