🗽Attention Is All You Need
type
status
date
slug
summary
tags
category
icon
password
encoder and decoder
entirely rely on an attention mechanism to draw a global dependencies between input and output
Learning process
Encoder & Decoder
Attention
Embeddings & Softmax
FFN
Positional Encoding
Model Architecture

start from a standard Encoder-Decoder architecture
Encoder and Decoder Stacks
For each layer we use a residual connection around each of the two sub-layer, followed by layer normalization

a residual connection means
- Encoder
Each layer = multi-head self-attention mechanism + position-wise fully connected feed-forward
lambda x: 匿名函数
在 Python 中,匿名函数(lambda 函数)是一种没有名字的函数。它使用关键字
lambda 来定义,而不是使用常规的 def 语句。匿名函数通常用于一些简单的功能,比如在需要函数对象的地方简洁地定义小函数。
- Decoder
Each layer = self-attn + src-attn + feed-forward
Mask: ensures that the predictions for position can depend only on the known outputs at positions less than

Attention
- Scaled Dot-Product Attention

: query : keys : values
Why ?

- Multi-head attention
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Position-wise Feed-Forward Networks
FFNs are applied to each position separately and identically
FFN = 2 linear transformations + RELU
Embeddings and Softmax
lut stands for “look up table”In the embedding layers, we multiply those weights by
Positional Encoding
to make use of the order of the sequence
inject some information about the relative or absolute position of the tokens in the sequence
Full Model
Original Paper
上一篇
Database System Notebook
下一篇
pi0 : Our First Generalist Policy
Loading...


