2017 年,谷歌的研究团队发表了具有里程碑意义的论文《Attention is All You Need》,首次提出了 Transformer 模型。这一创新架构极大地推动了自然语言处理(NLP)技术的发展,成为后续如 Generative Pre-trained Transformer(GPT),Pathways Language Model(PaLM)等大型语言模型(LLM)开发的基石,彻底改变了之前依赖传统神经网络,比如 Recurrent Neural Network(RNN)及其变种 Long Short-Term Memory(LSTM)和 Gated Recurrent Unit(GRU)的研究方向。
from sklearn.feature_extraction.text import TfidfVectorizer
defcompute_tfidf(documents):
vectorizer = TfidfVectorizer()
return vectorizer.fit_transform(raw_documents=documents)
if __name__ == "__main__":
docs = [
"I love natural language processing",
"In natural language processing, the sentences are represented as embeddings or vectors",
"The distance between the embedding vectors gives the contextual meaning between them"
]
result = compute_tfidf(documents=docs)
print(result.toarray())
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
defcompute(documents):
# preprocessing the text by tokenization, stemming
processed_docs = [simple_preprocess(document) for document in documents]
# train using Word2Vec, sg=0 is CBoW model
model_cbow = Word2Vec(sentences=processed_docs, window=5, vector_size=100, workers=5, min_count=1, sg=0)
# train using Skip-Gram, sg=1 is Skip-Gram model
model_skip_gram = Word2Vec(sentences=processed_docs, window=5, vector_size=100, workers=5, min_count=1, sg=1)
# Get the vector for a word from the CBOW model
vector_cbow = model_cbow.wv['language']
# Get the vector for a word from the Skip-Gram model
vector_skipgram = model_skip_gram.wv['language']
return model_cbow, model_skip_gram
defvisualize(model: Word2Vec):
# Retrieve word vectors and corresponding word labels from the model
word_vectors = model.wv.vectors
words = model.wv.index_to_key # List of words in the model # Use t-SNE to reduce word vectors to 2 dimensions for visualization, # this is like dimensionality reduction, similar to PCA
tsne = TSNE(n_components=2, random_state=0)
word_vectors_2d = tsne.fit_transform(word_vectors)
# Plotting the 2D word vectors with annotations
plt.figure(figsize=(10, 10))
for i, word inenumerate(words):
plt.scatter(word_vectors_2d[i, 0], word_vectors_2d[i, 1])
plt.text(word_vectors_2d[i, 0] + 0.03, word_vectors_2d[i, 1] + 0.03, word, fontsize=9)
plt.show()
if __name__ == "__main__":
# Sample dataset: Expressing liking, love, and interest in NLP
sentences = [
"The brilliant data scientist loves exploring the depths of NLP techniques.",
"I find immense joy in unraveling the mysteries behind language models.",
"NLP enthusiasts are fascinated by the way algorithms understand human language.",
"There's a certain beauty in teaching machines to interpret the nuances of words.",
"Discovering new applications for text embeddings fills me with excitement.",
"The passion for semantic analysis drives researchers to innovate.",
"She adores the challenge of making computers comprehend linguistic subtleties.",
"Our team is dedicated to advancing the frontiers of NLP with each project.",
"The breakthrough in sentiment analysis has captured the interest of many.",
"Witnessing the evolution of NLP technologies sparks a profound sense of wonder."
]
cbow_model, skip_gram_model = compute(documents=sentences)
visualize(cbow_model)
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
# Path to the downloaded GloVe file (change as needed)
glove_input_file = 'glove.6B.100d.txt'# Output file in Word2Vec format
word2vec_output_file = 'glove.6B.100d.word2vec.txt'# Convert GloVe format to Word2Vec format
glove2word2vec(glove_input_file, word2vec_output_file)
# Load the converted model
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
# Example: Retrieve the vector for the word 'computer'
word_vector = model['language']
print(word_vector)
# Perform similarity operations print(model.most_similar('language'))
OpenAI 允许你使用同一个'核心'模型,并根据不同的使用案例进行微调,无需重新训练核心模型(这会耗费大量时间和成本),这促成了预训练模型的兴起;这些模型属于 GPT 系列,包括 GPT-3 及其最新迭代,这些都可以通过 OpenAI 的 API 获得。
Google AI 的 BERT(Bidirectional Encoder Representations from Transformers)是首批流行模型之一,text-embedding-3-small 和 text-embedding-3-large 是最新也是性能最强的嵌入模型,它们引入新的参数,允许用户控制模型的整体大小。
代码实现:
from openai import OpenAI
client = OpenAI()
defget_embedding(text, model="text-embedding-3-small"):
text = text.replace("\n", " ")
return client.embeddings.create(input = [text], model=model)
.data[0].embedding
get_embedding("We are lucky to live in an age in which we are still making discoveries.")