Softmax 函数对非常大的输入值敏感。这些 input 会梯度消失,学习变慢甚至完全停止。由于点积的平均值随着嵌入维度 k 的增加而增大,因此点积送到 softmax 之前进行缩放有助于缓解这个问题。
原来执行 softmax 之前的权重矩阵:
$$w'_{ij} = q_i^T k_j$$
现在:
$$w'_{ij} = \frac{q_i^T k_j}{\sqrt{k}}$$
Why $\sqrt{k}$? Imagine a vector in $\mathbb{R}^k$ with values all c. Its Euclidean length is $\sqrt{k}c$. Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.
3.4.3 引入 Multi-Head Attention
最后,需要考虑到,同一个单词随着相邻单词们的不同表示的意思也可能不同。例如下面这个句子:mary, gave, roses, to, susan。
我们能构建的最简单 transformer 叫 sequence classifier(顺序分类器)。我们用 IMDb(Internet Movie Database)sentiment classification 数据集:
数据内容是影评;
Token 化成了单词序列;
分类标签是 positive 和 negative(对电影的正面/负面评价)。
架构的核心部分非常简单,就是一长串 transformer block。所需做的事情:
如何将 input sequence feed 给这个长链;
如何对最终 output sequence 进行变换,得到单个分类结果。
4.3.1 输出:单个分类结果
从 sequence-to-sequence layers 构建 sequence classifier 的最常见方法是对最终输出序列做 global average pooling,并将结果映射到 softmaxed class vector。
4.3.2 输入:词序敏感(Using the Positions)
前面已经讨论了嵌入层的原理,接下来我们将用它来表示单词。
正如前面已经提到的,我们正在堆叠(stacking)排列等变层(permutation equivariant layers),最终的 global average pooling 是排列不变的(permutation invariant),因此整个网络也是排列不变的。用白话来说,即使我们打乱句子中的单词顺序,无论我们学到什么权重,都会得到完全相同的分类结果。显然,我们希望这个先进的语言模型至少对词序具有一定的敏感性,因此我们需要解决这个问题。
解决方案很简单:创建一个与 input 等长的向量记录当前句子中单词的位置,并将其添加到 word embedding 中。具体到实现上,有两种选择。
classTransformer(nn.Module):
def__init__(self, k, heads, depth, seq_length, num_tokens, num_classes):
super().__init__()
self.num_tokens = num_tokens
self.token_emb = nn.Embedding(num_tokens, k)
self.pos_emb = nn.Embedding(seq_length, k)
# The sequence of transformer blocks that does all the heavy lifting
tblocks = []
for i inrange(depth):
tblocks.append(TransformerBlock(k=k, heads=heads))
self.tblocks = nn.Sequential(*tblocks)
# Maps the final output sequence to class logitsself.toprobs = nn.Linear(k, num_classes)
defforward(self, x):
"""
:param x: A (b, t) tensor of integer values representing words (in some predetermined vocabulary).
:return: A (b, c) tensor of log-probabilities over the classes (where c is the nr. of classes).
"""# generate token embeddings
tokens = self.token_emb(x)
b, t, k = tokens.size()
# generate position embeddings
positions = torch.arange(t)
positions = self.pos_emb(positions)[None, :, :].expand(b, t, k)
x = tokens + positions # 为什么文本嵌入和位置嵌入相加,没有理论,可能就是实验下来效果不错。# https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
x = self.tblocks(x)
# Average-pool over the t dimension and project to class probabilities
x = self.toprobs(x.mean(dim=1))
return F.log_softmax(x, dim=1)
1228X Human & Rousseau. Because many of his stories were originally published in long-forgotten magazines and journals, there are a number of [[anthology|anthologies]] by different collators each containing a different selection. His original books have been considered an anthologie in the [[Middle Ages]], and were likely to be one of the most common in the [[Indian Ocean]] in the [[1st century]]. As a result of his death, the Bible was recognised as a counter-attack by the [[Gospel of Matthew]] (1177-1133), and the [[Saxony|Saxons]] of the [[Isle of Matthew]] (1100-1138), the third was a topic of the [[Saxony|Saxon]] throne, and the [[Roman Empire|Roman]] troops of [[Antiochia]] (1145-1148). The [[Roman Empire|Romans]] resigned in [[1148]] and [[1148]] began to collapse. The [[Saxony|Saxons]] of the [[Battle of Valasander]] reported the y
4.4.3 文本生成结果分析
对于上面的输出,应该注意到:
输出的文本中正确使用了维基百科链接标签语法,链接内的文本准确表达了链接主题。
生成的内容也与主题大致一致:生成的文本以圣经和罗马帝国为主题,在不同的地方使用不同的相关术语。
还有一个不那么明显的地方:'Battle of Valasander',这场'战争'似乎是这个神经网络自己杜撰的。
I suppose the layer normalization is also nonlinear, but that is one nonlinearity that actually helps to keep the gradient stable as it propagates back down the network.
(Bidirectional Encoder Representations from Transformers) 是首批证明 transformer 可以在各种基于语言的任务上(question answering, sentiment classification or classifying whether two sentences naturally follow one another)达到人类水平的模型之一。
Masking: A certain number of words in the input sequence are masked out, replaced with a random word or kept as is. The model is then asked to predict, for these words, what the original words were. Note that the model doesn't need to predict the entire denoised sentence, just the modified words. Since the model doesn't know which words it will be asked about, it learns a representation for every word in the sequence.
Next Sequence Classification: Two sequences of about 256 words are sampled that either (a) follow each other directly in the corpus, or (b) are both taken from random places. The model must then predict whether a or b is the case.
BERT uses WordPiece tokenization, which is somewhere in between word-level and character level sequences. It breaks words like walking up into the tokens walk and ##ing. This allows the model to make some inferences based on word structure: two verbs ending in -ing have similar grammatical functions, and two verbs starting with walk- have similar semantic function.
The input is prepended with a special token. The output vector corresponding to this token is used as a sentence representation in sequence classification tasks like the next sentence classification (as opposed to the global average pooling over all vectors that we used in our classification model above).
After pretraining, a single task-specific layer is placed after the body of transformer blocks, which maps the general purpose representation to a task specific output. For classification tasks, this simply maps the first output token to softmax probabilities over the classes. For more complex tasks, a final sequence-to-sequence layer is designed specifically for the task.
The whole model is then re-trained to finetune the model for the specific task at hand.
In an ablation experiment, 作者展示了与之前的模型相比,最大的改进来自 BERT 的双向特性(bidirectional nature)。之前的模型,例如 GPT,使用的是 autoregressive mask,只允许 attention 使用前面的 token。在 BERT 中,all attention is over the whole sequence,这是性能提升的主要来源。