Transformer 模型核心原理与从零实现详解

Transformer 模型核心原理与从零实现详解 | 极客日志

import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split

# 加载数据集
try:
    df = pd.read_csv('IMDB Dataset.csv')
except FileNotFoundError:
    # 模拟数据用于演示
    df = pd.DataFrame({'review': ['I love this movie', 'This is terrible'], 
                       'sentiment': ['positive', 'negative']})

# 文本清洗
def clean_text(text):
    text = re.sub(r'<.*?>', '', str(text))
    text = re.sub(r'[^a-zA-Z\s]', '', str(text))
    return text.lower()

df['clean_review'] = df['review'].apply(clean_text)
df['label'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_review'], df['label'], test_size=0.2, random_state=42
)

from collections import Counter

vocab = Counter()
for text in X_train:
    vocab.update(text.split())

vocab_size = min(len(vocab), 5000)
vocab_list = [w for w, _ in vocab.most_common(vocab_size)]
word2idx = {w: i + 1 for i, w in enumerate(vocab_list)}
word2idx['<PAD>'] = 0
word2idx['<UNK>'] = len(word2idx)

def text_to_sequence(text, max_len=100):
    seq = [word2idx.get(w, word2idx['<UNK>']) for w in text.split()]
    if len(seq) < max_len:
        seq += [0] * (max_len - len(seq))
    else:
        seq = seq[:max_len]
    return np.array(seq)

X_train_seq = np.array([text_to_sequence(t) for t in X_train])
X_test_seq = np.array([text_to_sequence(t) for t in X_test])

class ScaledDotProductAttention:
    def __init__(self, d_k):
        self.d_k = d_k
    
    def forward(self, Q, K, V, mask=None):
        scores = np.dot(Q, K.T) / np.sqrt(self.d_k)
        if mask is not None:
            scores = np.where(mask == 0, scores, -1e9)
        attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)
        output = np.dot(attention_weights, V)
        return output, attention_weights

import matplotlib.pyplot as plt
import seaborn as sns

# 词频可视化
word_counts = pd.DataFrame(list(vocab.items()), columns=['word', 'count']).sort_values(by='count', ascending=False).head(20)
plt.figure(figsize=(12, 6))
sns.barplot(x='count', y='word', data=word_counts, palette='viridis')
plt.title('Top 20 Words by Frequency')
plt.tight_layout()
plt.show()

# 模拟训练损失曲线
losses = np.random.rand(100) * 0.5 + 0.1
plt.figure(figsize=(12, 6))
plt.plot(losses, color='blue', linewidth=2)
plt.title('Training Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()

Transformer 模型核心原理与从零实现详解

Transformer 模型核心原理与从零实现详解

1. Transformer 简介

1.1 核心优势

2. 核心组件详解

2.1 自注意力机制 (Scaled Dot-Product Attention)

2.2 位置编码 (Positional Encoding)

2.3 多头注意力 (Multi-Head Attention)

2.4 前馈网络与残差连接

3. 案例构建：基于 NumPy 的实现

3.1 数据集准备

3.2 词汇表与编码

3.3 自注意力类实现

3.4 完整模型结构概览

3.5 训练与可视化

4. 总结

更多推荐文章

相关免费在线工具

Transformer 模型核心原理与从零实现详解

Transformer 模型核心原理与从零实现详解

1. Transformer 简介

1.1 核心优势

2. 核心组件详解

2.1 自注意力机制 (Scaled Dot-Product Attention)

2.2 位置编码 (Positional Encoding)

2.3 多头注意力 (Multi-Head Attention)

2.4 前馈网络与残差连接

3. 案例构建：基于 NumPy 的实现

3.1 数据集准备

3.2 词汇表与编码

3.3 自注意力类实现

3.4 完整模型结构概览

3.5 训练与可视化

4. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具