Transformer 架构详解：从 RNN 挑战到自注意力机制与词嵌入

Transformer 架构概览

2017 年，谷歌的研究团队发表了具有里程碑意义的论文《Attention is All You Need》，首次提出了 Transformer 模型。这一创新架构极大地推动了自然语言处理（NLP）技术的发展，成为后续如 Generative Pre-trained Transformer（GPT），Pathways Language Model（PaLM）等大型语言模型（LLM）开发的基石，彻底改变了之前依赖传统神经网络，比如 Recurrent Neural Network（RNN）及其变种 Long Short-Term Memory（LSTM）和 Gated Recurrent Unit（GRU）的研究方向。

RNN 面临的挑战

循环神经网络（RNN）是神经网络（NN）的一种特别设计，专门用于处理按顺序排列的数据，比如文本、音频、时间序列等。它的独到之处在于引入了'记忆'功能，让网络能记住之前输入的信息。这种记忆功能在处理需要理解上下文的任务时显得尤为重要，比如在 NLP 中，语义的理解和生成过程。

如果从视觉角度描述，一个标准的 RNN 结构看起来就像是一个计算单元，它通过一个自连接的隐藏状态进行信息循环，让信息能够跨时间步（St）传递：

RNN 结构示意图

随着数据在 RNN 中的流动，之前时间步的激活状态会作为输入参与到当前数据的处理中，让模型能够动态地融合时间上下文和序列的历史信息。这一点对于很多序列到序列（Seq2Seq）的预测任务尤为关键。

RNN 及其变种 LSTM 和 GRU 曾是序列模型的核心，专为顺序处理数据和捕捉时间序列依赖而设计。不过，它们面临几个关键挑战，这些挑战限制了其效能和效率：

难以理解长期关联：

梯度消失问题：在反向传播时，RNN 面临梯度逐渐减小直至消失的问题，这使得模型难以学习序列中远距离元素间的关系。
梯度爆炸问题：另一方面，梯度可能会急剧增加，引发梯度爆炸问题，这会破坏学习过程的稳定性。

顺序处理的局限：

固有的顺序处理机制限制了并行处理的可能性，导致处理长序列时训练和推理速度缓慢。

计算和内存负担：

高计算需求：RNN，LSTM 和 GRU 因其复杂结构而计算量巨大，这些结构旨在解决梯度消失问题。
内存限制：维护长序列的隐藏状态需要大量内存，这对扩展性构成了挑战。

想象一个简单的任务，RNN 需要预测句子中的下一个单词。如果 RNN 在尝试预测之前只看到了一个词，那么它猜对的可能性不高。如果我们通过让它观察更多之前的词来提高其预测能力，我们就需要更多的计算资源。但即使有了更多资源和数据，RNN 依然面临困难，因为它需要完整理解整个句子乃至整篇文章来准确预测。仅依赖观察前几个词是不够的，它需要全面理解上下文。

Transformer 架构优势

Transformer 模型的推出，正是 Vaswani 在其里程碑式的论文《Attention is All You Need》中所做的工作，这一创新不仅突破了传统限制，还为解决 RNN 面临的问题提供了解决方案。

Transformer 架构图

Transformer 架构包括编码器和解码器两部分，每部分都含有多个层，这些层集成了多头自注意力机制和前馈神经网络，共同工作以提升处理效率和性能。

Transformer 的工作方式是同时处理句子中的全部词序，而不是像 RNN 那样，一次只处理一个词。这种处理机制让 Transformer 在捕捉句子内词语之间的上下文关系和相互作用方面更为出色，对于理解人类语言来说，这一点极为关键。

此外，Transformer 采用了一种名为自注意力（self-attention）的技术，能够对句中各词赋予不同的权重，并集中关注对完成特定任务最关键的词语。正是这种机制，使得 Transformer 能够灵活应对各种任务，并且达到非常高的准确度。

特性	RNN	Transformer
处理方式	顺序处理	并行处理
长上下文理解	难以捕捉长依赖	通过自注意力机制捕捉长依赖
可伸缩性	差	好
应用领域	自然语言处理	自然语言处理、计算机视觉、语音识别等
训练速度	慢	快

from gensim.utils import simple_preprocess from gensim.models import Word2Vec from sklearn.manifold import TSNE import matplotlib.pyplot as plt def compute(documents): # preprocessing the text by tokenization, stemming processed_docs = [simple_preprocess(document) for document in documents] # train using Word2Vec, sg=0 is CBoW model model_cbow = Word2Vec(sentences=processed_docs, window=5, vector_size=100, workers=5, min_count=1, sg=0) # train using Skip-Gram, sg=1 is Skip-Gram model model_skip_gram = Word2Vec(sentences=processed_docs, window=5, vector_size=100, workers=5, min_count=1, sg=1) # Get the vector for a word from the CBOW model vector_cbow = model_cbow.wv['language'] # Get the vector for a word from the Skip-Gram model vector_skipgram = model_skip_gram.wv['language'] return model_cbow, model_skip_gram def visualize(model: Word2Vec): # Retrieve word vectors and corresponding word labels from the model word_vectors = model.wv.vectors words = model.wv.index_to_key # List of words in the model # Use t-SNE to reduce word vectors to 2 dimensions for visualization, # this is like dimensionality reduction, similar to PCA tsne = TSNE(n_components=2, random_state=0) word_vectors_2d = tsne.fit_transform(word_vectors) # Plotting the 2D word vectors with annotations plt.figure(figsize=(10, 10)) for i, word in enumerate(words): plt.scatter(word_vectors_2d[i, 0], word_vectors_2d[i, 1]) plt.text(word_vectors_2d[i, 0] + 0.03, word_vectors_2d[i, 1] + 0.03, word, fontsize=9) plt.show() if __name__ == "__main__": # Sample dataset: Expressing liking, love, and interest in NLP sentences = [ "The brilliant data scientist loves exploring the depths of NLP techniques.", "I find immense joy in unraveling the mysteries behind language models.", "NLP enthusiasts are fascinated by the way algorithms understand human language.", "There's a certain beauty in teaching machines to interpret the nuances of words.", "Discovering new applications for text embeddings fills me with excitement.", "The passion for semantic analysis drives researchers to innovate.", "She adores the challenge of making computers comprehend linguistic subtleties.", "Our team is dedicated to advancing the frontiers of NLP with each project.", "The breakthrough in sentiment analysis has captured the interest of many.", "Witnessing the evolution of NLP technologies sparks a profound sense of wonder." ] cbow_model, skip_gram_model = compute(documents=sentences) visualize(cbow_model)