协同过滤:这种类型的推荐系统使用具有类似偏好的其他用户的评分或反馈。它假设过去喜欢某些物品的用户将来会喜欢类似的物品。例如,如果用户 A 和用户 B 都喜欢电影 X 和 Y,那么如果用户 B 也喜欢电影 Z,算法可能会将电影 Z 推荐给用户 A。
协同过滤可以进一步分为两种子类型:基于用户的协同过滤和基于物品的协同过滤:
基于用户的协同过滤查找与目标用户相似的用户,并推荐他们喜欢的物品。
基于物品的协同过滤查找与目标用户喜欢的物品相似的物品,并推荐这些物品。
基于内容的过滤:这种类型的推荐系统使用物品本身的特征或属性,推荐与目标用户之前喜欢或互动过的物品相似的物品。它假设喜欢某个物品特定特征的用户会喜欢其他具有相似特征的物品。与基于物品的协同过滤的主要区别在于,后者使用用户行为模式进行推荐,而基于内容的过滤则使用物品本身的信息。例如,如果用户 A 喜欢电影 X,X 是一部由演员 Y 主演的喜剧片,那么算法可能会推荐电影 Z,Z 也是一部由演员 Y 主演的喜剧片。
基于知识的过滤:这种类型的推荐系统使用有关领域和用户需求或偏好的显式知识或规则,推荐满足特定标准或约束的物品。它不依赖于其他用户的评分或反馈,而是基于用户的输入或查询。例如,如果用户 A 想购买一台具有特定规格和预算的笔记本电脑,算法可能会推荐一台符合这些标准的笔记本电脑。基于知识的推荐系统在评分历史很少或没有可用时,或者物品复杂且可定制时效果很好。
在上述框架内,可以使用各种机器学习技术,我们将在下一节中讨论这些技术。
现有的推荐系统
现代推荐系统使用机器学习(ML)技术来基于以下数据更好地预测用户的偏好:
用户行为数据:关于用户与产品交互的见解。这些数据可以从用户评分、点击和购买记录等因素中获取。
用户人口统计数据:指用户的个人信息,包括年龄、教育背景、收入水平和地理位置等详细信息。
产品属性数据:涉及产品特征的信息,例如书籍的类型、电影的演员阵容或食品中的特定菜系。
截至目前,一些最流行的 ML 技术包括 K 近邻(K-nearest neighbors)、降维(dimensionality reduction)和神经网络。让我们详细了解这些方法。
K 近邻算法
K 近邻(KNN)是一种可以用于分类和回归问题的机器学习算法。它通过找到距离新数据点最近的 k 个数据点(k 是用户在初始化算法之前设置的),并使用它们的标签或值来进行预测。KNN 基于一个假设,即相似的数据点可能具有相似的标签或值。
KNN 可以应用于推荐系统中的协同过滤,包括基于用户的协同过滤和基于物品的协同过滤:
基于用户的 KNN 是一种协同过滤方法,它使用具有类似品味或偏好的其他用户的评分或反馈来推荐物品。例如,假设我们有三位用户:Alice、Bob 和 Charlie。他们都在网上购买书籍并对其进行评分。Alice 和 Bob 都喜欢(高评分)《哈利·波特》系列和《霍比特人》这本书。系统会发现这个模式,并认为 Alice 和 Bob 的品味相似。如果 Bob 还喜欢《权力的游戏》这本书,而 Alice 尚未阅读,系统就会向 Alice 推荐《权力的游戏》。这是因为系统假设既然 Alice 和 Bob 有相似的品味,那么 Alice 也可能喜欢《权力的游戏》。
基于物品的 KNN 是另一种协同过滤方法,它使用物品的属性或特征来向目标用户推荐类似的物品。例如,考虑同样的用户及其对书籍的评分。系统注意到《哈利·波特》系列和《霍比特人》都得到了 Alice 和 Bob 的喜欢,所以系统认为这两本书是相似的。如果 Charlie 读过并喜欢《哈利·波特》,系统就会向 Charlie 推荐《霍比特人》。这是因为系统假设既然《哈利·波特》和《霍比特人》相似(都被相同的用户喜欢),Charlie 也可能喜欢《霍比特人》。
尽管近年来取得了相关的进展,但上述技术仍存在一些缺陷,主要是它们具有任务特定性。例如,一个基于评分预测的推荐系统无法处理需要推荐与用户口味相符的前 k 个物品的任务。实际上,如果我们将这种限制扩展到其他'LLM 之前'的 AI 解决方案,我们可能会发现一些相似之处:实际上,正是这种任务特定的情况正在被 LLM(以及更广泛的 Large Foundation Models)所革命化,它们高度泛化并且能够根据用户的提示和指令适应各种任务。因此,关于 LLM 在多大程度上能够增强现有推荐模型的研究正在广泛展开。在接下来的部分中,我们将参考最近的论文和博客,讨论这些新方法背后的理论。
LLMs 如何改变推荐系统
在前几章中,我们了解了如何通过三种主要方式定制 LLM:预训练、微调和提示。根据 Wenqi Fan 等人撰写的论文《大语言模型时代的推荐系统》(Recommender systems in the Era of Large Language Models (LLMs)),这些技术也可以用来将 LLM 定制为推荐系统:
'Title: GoldenEye. Overview: James Bond must unmask the mysterious head of the Janus Syndicate and prevent the leader from utilizing the GoldenEye weapons system to inflict devastating revenge on Britain. Genres: Adventure, Action, Thriller. Rating: 6.173464373464373'
在之前的部分中,我们已经将嵌入保存到 LanceDB 中。现在,我们将构建一个 LangChain 的 RetrievalQA 检索器,这是一个专为基于索引的问答而设计的链组件。在我们的例子中,我们将使用向量存储作为我们的索引检索器。该链的想法是根据用户的查询,返回最相似的前 k 部电影,使用余弦相似度作为距离度量(这是默认的度量方式)。
让我们开始构建这个链:
我们仅使用电影的简介作为信息输入:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import LanceDB
import os
os.environ["OPENAI_API_KEY"]
embeddings = OpenAIEmbeddings()
docsearch = LanceDB(connection=table, embedding=embeddings)
query = "I'm looking for an animated action movie. What could you suggest to me?"
docs = docsearch.similarity_search(query)
docs
以下是对应的输出(我将显示输出的一个截断版本,仅展示四个文档来源中的第一个):
[Document(page_content='Title: Hitman: Agent 47. Overview: An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry. Genres: Action, Crime, Thriller. Rating: 5.365800865800866', metadata={'genres': array(['Action', 'Crime', 'Thriller'], dtype=object), 'title': 'Hitman: Agent 47', 'overview': 'An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry.', 'weighted_rate': 5.365800865800866, 'n_tokens': 52, 'vector': array([-0.00566491, -0.01658553, ……])
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)
query = "I'm looking for an animated action movie. What could you suggest to me?"
result = qa({"query": query})
result['result']
让我们看看输出结果:
' I would suggest Transformers. It is an animated action movie with genres of Adventure, Science Fiction, and Action, and a rating of 6.'
Document(page_content='Title: Hitman: Agent 47. Overview: An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry. Genres: Action, Crime, Thriller. Rating: 5.365800865800866', metadata={'genres': array(['Action', 'Crime', 'Thriller'], dtype=object), 'title': 'Hitman: Agent 47', 'overview': 'An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry.', 'weighted_rate': 5.365800865800866, 'n_tokens': 52, 'vector': array([-0.00566491, -0.01658553, -0.02255735, ..., -0.01242317, -0.01303058, -0.00709073], dtype=float32), '_distance': 0.42414575815200806})
df_filtered = md[md['genres'].apply(lambda x: 'Comedy'in x)]
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff",
retriever=docsearch.as_retriever(search_kwargs={'data': df_filtered}), return_source_documents=True)
query = "I'm looking for a movie with animals and an adventurous plot."
result = qa({"query": query})
from langchain.agents.agent_toolkits import create_retriever_tool
from langchain.agents.agent_toolkits import create_conversational_retrieval_agent
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0)
retriever = docsearch.as_retriever(return_source_documents=True)
tool = create_retriever_tool(
retriever,
"movies",
"Searches and returns recommendations about movies."
)
tools = [tool]
agent_executor = create_conversational_retrieval_agent(llm, tools, verbose=True)
result = agent_executor({"input": "suggest me some action movies"})
让我们看看思维链和输出(始终基于根据余弦相似度得出的最相似的四部电影)的结果:
> Entering new AgentExecutor chain...
Invoking: `movies` with `{'genre': 'action'}`
[Document(page_content='The action continues from [REC], ……]
Here are some action movies that you might enjoy:
1. [REC]² - The action continues from [REC], with a medical officer and a SWAT team sent into a sealed-off apartment to control the situation. It is a thriller/horror movie.
2. The Boondock Saints - Twin brothers Conner and Murphy take swift retribution into their hands to rid Boston of criminals. It is an action/thriller/crime movie.
3. The Gamers - Four clueless players are sent on a quest to rescue a princess and must navigate dangerous forests, ancient ruins, and more. It is an action/comedy/thriller/foreign movie.
4. Atlas Shrugged Part III: Who is John Galt? - In a collapsing economy, one man has the answer while others tryto control or save him. It is a drama/science fiction/mystery movie.
Please note that these recommendations are based on the genre "action"and may vary in terms of availability and personal preferences.
> Finished chain.
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:
from langchain.prompts import PromptTemplate
template = """You are a movie recommender system that help users to find movies that match their preferences.
Use the following pieces of context to answer the question at the end.
For each question, suggest three movies, with a short description of the plot and the reason why the user might like it.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Your response:"""
PROMPT = PromptTemplate(
template=template, input_variables=["context", "question"])
现在我们需要将其传递到我们的链中:
PROMPT = PromptTemplate(
template=template, input_variables=["context", "question"])
chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=OpenAI(),
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs)
query = "I'm looking for a funny action movie, any suggestion?"
result = qa({'query':query})
print(result['result'])
得到以下输出:
1. A Good Day to Die Hard: An action-packed comedy directed by John Moore, this movie follows Iconoclastic, take-no-prisoners cop John McClane as he travels to Moscow to help his wayward son Jack. With the Russian underworld in pursuit, and battling a countdown to war, the two McClanes discover that their opposing methods make them unstoppable heroes.
2. The Hidden: An alien ison the run in America and uses the bodies of anyone in its way as a hiding place. With lots of innocent people dying in the chase, this action-packed horror movie is sure to keep you laughing.
3. District B13: Setin the ghettos of Paris in2010, this action-packed science fiction movie follows an undercover cop and ex-thug as they tryto infiltrate a gang inorderto defuse a neutron bomb. A thrilling comedy that will keep you laughing.
from langchain.prompts import PromptTemplate
template_prefix = """You are a movie recommender system that help users to find movies that match their preferences.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}"""
user_info = """This is what we know about the user, and you can use this information to better tune your research:
Age: {age}
Gender: {gender}"""
template_suffix = """Question: {question}
Your response:"""
user_info = user_info.format(age=18, gender='female')
COMBINED_PROMPT = template_prefix + '\n' + user_info + '\n' + template_suffix
print(COMBINED_PROMPT)
输出结果如下:
You are a movie recommender system that help users to find movies that match their preferences.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
This is what we know about the user, and you can use this information to better tune your research:
Age:18Gender: female
Question: {question}
Your response:
' Sure, I can suggest some action movies for you. Here are a few examples: A Good Day to Die Hard, Goldfinger, Ong Bak 2, and The Raid 2. All of these movies have high ratings and feature thrilling action elements. I hope you find something that you enjoy!'
template_prefix = """You are a movie recommender system that help users to find movies that match their preferences.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}"""
user_info = """This is what we know about the user, and you can use this information to better tune your research:
Age: {age}
Gender: {gender}
Movies already seen alongside with rating: {movies}"""
template_suffix= """Question: {question}
Your response:"""
然后我们按如下方式格式化 user_info 块(假设与系统交互的用户是 Alice):
age = df.loc[df['username']=='Alice']['age'][0]
gender = df.loc[df['username']=='Alice']['gender'][0]
movies = ''# 遍历字典并输出电影名称和评分for movie, rating in df['movies'][0].items():
output_string = f"Movie: {movie}, Rating: {rating}" + "\n"
movies += output_string
user_info = user_info.format(age=age, gender=gender, movies=movies)
COMBINED_PROMPT = template_prefix +'\n'+ user_info +'\n'+ template_suffix
print(COMBINED_PROMPT)
以下是输出结果:
You are a movie recommender system that help users to find movies that match their preferences.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
This is what we know about the user, and you can use this information to better tune your research:
Age:25Gender: F
Movies already seen alongside with rating: Movie: Transformers: The Last Knight, Rating: 7Movie: Pokémon: Spell of the Unknown, Rating: 5Question: {question}
Your response:
现在让我们在链中使用这个提示:
PROMPT = PromptTemplate(
template=COMBINED_PROMPT, input_variables=["context", "question"])
chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=OpenAI(),
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs)
query = "Can you suggest me some action movie based on my background?"
result = qa({'query': query})
result['result']
我们得到以下输出:
" Based on your age, gender, and the movies you've already seen, I would suggest the following action movies: The Raid 2 (Action, Crime, Thriller; Rating: 6.71), Ong Bak 2 (Adventure, Action, Thriller; Rating: 5.24), Hitman: Agent 47 (Action, Crime, Thriller; Rating: 5.37), and Kingsman: The Secret Service (Crime, Comedy, Action, Adventure; Rating: 7.43)."
如您所见,模型现在能够根据用户过去偏好的信息(作为上下文从模型的元提示中检索)向 Alice 推荐一系列电影。
和第六章的 Globebotter 应用程序一样,在这种情况下,您也需要创建一个.py 文件,通过 streamlit run file.py 在终端中运行。在我们的例子中,该文件将被命名为 movieharbor.py。
现在让我们总结一下构建前端应用程序的关键步骤:
配置应用程序网页:
import streamlit as st
st.set_page_config(page_title="MovieHarbor", page_icon="")
st.header(' Welcome to MovieHarbor, your favourite movie recommender')
# 为用户输入创建侧边栏
st.sidebar.title("Movie Recommendation System")
st.sidebar.markdown("Please enter your details and preferences below:")
# 询问用户年龄、性别和最喜欢的电影类型
age = st.sidebar.slider("What is your age?", 1, 100, 25)
gender = st.sidebar.radio("What is your gender?", ("Male", "Female", "Other"))
genre = st.sidebar.selectbox("What is your favourite movie genre?", md.explode('genres')["genres"].unique())
# 根据用户输入过滤电影
df_filtered = md[md['genres'].apply(lambda x: genre in x)]
定义参数化的提示块:
template_prefix = """You are a movie recommender system that helps users to find movies that match their preferences.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}"""
user_info = """This is what we know about the user, and you can use this information to better tune your research:
Age: {age}
Gender: {gender}"""
template_suffix = """Question: {question}
Your response:"""
user_info = user_info.format(age=age, gender=gender)
COMBINED_PROMPT = template_prefix +'\n'+ user_info +'\n'+ template_suffix
print(COMBINED_PROMPT)
query = st.text_input('Enter your question:', placeholder='What action movies do you suggest?')
if query:
result = qa({"query": query})
st.write(result['result'])
就这样!您可以通过在终端中运行 streamlit run movieharbor.py 来查看最终结果。运行结果如下所示: