Llama-2 与 Llama-3：模型之间的井字棋对决

优质文章学习记录

05 Apr 2026 — 12 min read

原文：towardsdatascience.com/llama-2-vs-llama-3-a-tic-tac-toe-battle-between-models-7301962ca65d

在撰写这个故事的大约一周前，Meta 发布了新的开源 Llama-3 模型 ai.meta.com/blog/meta-llama-3/。Meta 声称，这些是“今天在 8B 和 70B 参数尺度上存在的最佳模型。”例如，根据 HuggingFace 模型页面，Llama-3 8B 在 MMLU（大规模多任务语言理解）基准测试中获得了 66.6 分，而 Llama-2 7B 则获得了 45.7 分。Llama-3 在常识问答（常识问题回答的数据集）中也获得了 72.6 比 57.6 的分数。指令调整后的 Llama-3 8B 模型在数学基准测试中获得了 30.0 分，而 3.8 分则是一个令人印象深刻的改进。

学术基准很重要，但我们能否看到“实际操作”中的真正差异？显然，我们可以，而且这可以很有趣。让我们编写一个两个模型之间的井字棋游戏，看看哪个会赢！在游戏过程中，我将测试所有 7B、8B 和 70B 模型。同时，我还会收集一些关于模型性能和系统要求的数据。所有测试都可以在 Google Colab 中免费运行。

让我们开始吧！

加载模型

为了测试所有模型，我将使用 Llama-cpp Python 库，因为它可以在 CPU 和 GPU 上运行。我们需要并行运行两个 LLM。7B 和 8B 模型可以轻松地在免费的 16GB Google Colab GPU 实例上运行，但 70B 模型只能使用 CPU 进行测试；即使是 NVIDIA A100 也没有足够的 RAM 同时运行两个模型。

首先，让我们安装 Llama-cpp 并下载 7B 和 8B 模型（对于 70B 模型，过程相同；我们只需要更改 URL）：

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python -U !pip3 install huggingface-hub hf-transfer sentence-transformers !export HF_HUB_ENABLE_HF_TRANSFER="1"&amp;&amp; huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir/content --local-dir-use-symlinks False !export HF_HUB_ENABLE_HF_TRANSFER="1"&amp;&amp; huggingface-cli download QuantFactory/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --local-dir/content --local-dir-use-symlinks False

下载完成后，让我们开始启动模型：

from llama_cpp import Llama llama2 = Llama( model_path="/content/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=-1, n_ctx=1024, echo=False) llama3 = Llama( model_path="/content/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf", n_gpu_layers=-1, n_ctx=1024, echo=False)

现在，让我们准备一个执行提示的函数：

defllm_make_move(model: Llama, prompt:str)->str:""" Call a model with a prompt """ res = model(prompt, stream=False, max_tokens=1024, temperature=0.8)return res["choices"][0]["text"]

提示

现在是编写井字棋游戏过程代码的时候了。游戏的目标是在棋盘上画出 “X” 和 “O”，首先完成水平、垂直或对角线的玩家获胜：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/08d3696000103e2305b9cc8886c5c406.png

图片来源维基百科

正如我们所见，游戏对人类来说很简单，但对于语言模型来说可能具有挑战性；做出正确的移动需要理解棋盘空间、物体之间的关系，甚至一些简单的数学。

首先，让我们将棋盘编码为二维数组。我还会创建一个将棋盘转换为字符串的方法：

board =[["E","E","E"],["E","E","E"],["E","E","E"]]defboard_to_string(board_data: List)->str:""" Convert board to the string representation """return"n".join([" ".join(x)for x in board_data])

输出看起来像这样：

E E E E E E E E E

现在，我们准备创建模型提示：

sys_prompt1 ="""You play a tic-tac-toe game. You make a move by placing X, your opponent plays by placing O. Empty cells are marked with E. You can place X only to the empty cell.""" sys_prompt2 ="""You play a tic-tac-toe game. You make a move by placing O, your opponent plays by placing X. Empty cells are marked with E. You can place O only to the empty cell.""" game_prompt ="""What is your next move? Think in steps. Each row and column should be in range 1..3\. Write the answer in JSON as {"ROW": ROW, "COLUMN": COLUMN}."""

在这里，我为模型 1 和 2 创建了两个提示。正如我们所见，句子是相同的；唯一的区别是第一个模型通过放置 Xs 来进行移动，而第二个模型则是放置 Os。

Llama-2 和 Llama-3 的提示格式不同：

template_llama2 =f"""<s>[INST]<<SYS>>{sys_prompt1}<</SYS>> Here is the board image: __BOARD__n {game_prompt} [/INST]""" template_llama3 =f"""<|begin_of_text|> <|start_header_id|>system<|end_header_id|>{sys_prompt2}<|eot_id|> <|start_header_id|>user<|end_header_id|> Here is the board image: __BOARD__n {game_prompt} <|eot_id|> <|start_header_id|>assistant<|end_header_id|>"""

让我们再创建两个方法来使用这些提示：

defmake_prompt_llama2(board: List)->str:""" Make Llama-2 prompt """return template_llama2.replace("__BOARD__", board_to_string(board))defmake_prompt_llama3(board: List)->str:""" Make Llama-3 prompt """return template_llama3.replace("__BOARD__", board_to_string(board))

编码游戏

我们已经准备好了提示；现在是时候编码游戏本身了。在一个提示中，我要求模型以 JSON 格式返回答案。实际上，模型可以回答这个：

My next move would be to place my X in the top-right corner, on cell (3,1).{"ROW":3,"COLUMN":1}

让我们创建一个方法来从这种类型的字符串中提取 JSON：

defextract_json(response:str)-> Optional[dict]:""" Extract dictionary from a response string """try:# Models sometimes to a mistake, fix: {ROW: 1, COLUMN: 2} => {"ROW": 1, "COLUMN": 2} response = response.replace('ROW:','"ROW":').replace('COLUMN:','"COLUMN":')# Extract json from a response pos_end = response.rfind("}") pos_start = response.rfind("{")return json.loads(response[pos_start:pos_end+1])except Exception as exp:print(f"extract_json::cannot parse output: {exp}")returnNone

结果表明，LLaMA-2 模型并不总是生成有效的 JSON；很多时候，它会产生类似于 “{ROW: 3, COLUMN: 3}” 的响应。正如我们在代码中所看到的，在这种情况下，我使用适当的引号更新了字符串中的引号。

在得到行和列之后，我们可以更新棋盘：

defmake_move(board_data: List, move: Optional[dict], symb:str):""" Update board with a new symbol """ row, col =int(move["ROW"]),int(move["COLUMN"])if1<= row <=3and1<= col <=3:if board_data[row -1][col -1]=="E": board_data[row -1][col -1]= symb else:print(f"Wrong move: cell {row}:{col} is not empty")else:print("Wrong move: incorrect index")

我们还需要检查游戏是否结束：

defcheck_for_end_game(board_data: List)->bool:""" Check if there are no empty cells available """return board_to_string(board_data).find("E")==-1defcheck_for_win(board_data: List)->bool:""" Check if the game is over """# Check Horizontal and Vertical linesfor ind inrange(3):if board_data[ind][0]== board_data[ind][1]== board_data[ind][2]and board_data[ind][0]!="E":print(f"{board_data[ind][0]} win!")returnTrueif board_data[0][ind]== board_data[1][ind]== board_data[2][ind]and board_data[0][ind]!="E":print(f"{board_data[0][ind]} win!")returnTrue# Check Diagonalsif board_data[0][0]== board_data[1][1]== board_data[2][2]and board_data[1][1]!="E"or board_data[2][0]== board_data[1][1]== board_data[0][2]and board_data[1][1]!="E":print(f"{board_data[1][1]} win!")returnTruereturnFalse

在这里，我通过循环检查水平、垂直和对角线。虽然这可能有一个更短的解决方案，但对于这个任务来说已经足够好了。

我们的所有组件都已准备就绪。让我们将它们组合在一起：

num_wins1, num_wins2 =0,0 times_1, times_2 =[],[]defrun_game():""" Run a game between two models """ board =[["E","E","E"],["E","E","E"],["E","E","E"]] moves_limit =20for step inrange(moves_limit):print(f"Step {step+1}")# Move: Model-1 t_start = time.monotonic() prompt = make_prompt_llama2(board) result_str = llm_make_move(llama2, prompt) times_1.append(time.monotonic()- t_start) new_data = extract_json(result_str)if new_data isnotNone: make_move(board, new_data, symb="X")if check_for_win(board):print('**Model 1 Won**') num_wins1 +=1breakif check_for_end_game(board):break# Move: Model-2 t_start = time.monotonic() prompt = make_prompt_llama3(board) result_str = llm_make_move(llama3, prompt) times_2.append(time.monotonic()- t_start) new_data = extract_json(result_str)if new_data isnotNone: make_move(board, new_data, symb="O")if check_for_win(board):print('**Model 2 Won**') num_wins2 +=1breakif check_for_end_game(board):breakprint()

在这里，我还保存了每个步骤的执行时间和每个模型的胜利次数。

结果

关于结果，它们很有趣。

对于 7B 和 8B 模型，井字棋游戏具有挑战性。7B Llama-2 模型对规则有一些理解，但它无法掌握棋盘坐标的概念，经常将“X”放置在错误的位置。

一些请求和响应看起来像这样：

>>> Prompt:<s>[INST]<<SYS>>You play a tic-tac-toe game. You make a move by placing X, your opponent plays by placing O. Empty cells are marked with E. You can place X only to the empty cell.<</SYS>> Here is the board image: X X O X O E E E E What is your next move? Think in steps. Each row and column should be inrange1..3\. Write the answer in JSON as{"ROW": ROW,"COLUMN": COLUMN}.[/INST]>>> Response: Since there are no empty cells on the board, I will place my X in the center cell:{"ROW":2,"COLUMN":1}.

如我们所见，这个响应中存在几个错误。首先，棋盘上足够多的空单元格；其次，(2,1) 位置不代表中心；第三，这个位置本身就不是空的。

70B LLaMA-2 模型总体上表现更好，但也犯了很多错误。这是一个典型的请求和响应：

>>> Prompt:<s>[INST]<<SYS>>You play a tic-tac-toe game. You make a move by placing X, your opponent plays by placing O. Empty cells are marked with E. You can place X only to the empty cell.<</SYS>> Here is the board image: O E E E X E E E E ...>>> Response: My next move would be to place my X in the center cell, which is empty. Here is the updated board: O E E E X X E E E JSON representation of my move:{"ROW":2,"COLUMN":2}

如我们所见，70B 模型正确地确定了“中心”位置，但移动本身是错误的；模型仍然没有“理解”到中心单元格是空的。模型还试图“绘制”一个新的棋盘，但“绘制”也是错误的。

有趣的是，ChatGPT 3.5 也对相同的提示给出了错误的答案，并产生了相同的 {“ROW”: 2, “COLUMN”: 2} 结果。但是 LLaMA-3 70B 做对了。然而，它仍然犯了类似的错误，有时试图在已占用的单元格中放置符号。我没有记录每个模型的错误总数，尽管这可能是一个有用的改进。

在 条形图形式中，7B 和 8B 模型的结果如下：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/7db87fb43ea9046dc587a3e1bd7c11fd.png

7B 和 8B 模型的游戏得分，图片由作者提供

胜利者显而易见：Llama-3 以 10:0 的比分获胜！

我们还可以看到两个模型在运行在 16 GB NVIDIA T4 GPU 上的 推理时间：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/528dc1ec5eccfe18795ca31d69914149.png https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/191916013b4b43089c55ef65dc360dbb.png

7B 和 8B 模型的推理时间，图片由作者提供

作为轻微的缺点，与之前的模型相比，Llama-3 的速度较慢（分别为 2.5 秒和 4.3 秒）。实际上，4.3 秒已经足够好了，因为在大多数情况下，使用流式传输，没有人期望立即得到答案。

Llama-2 70B的表现更好。它可以赢两次，但 LLama-3 在几乎所有情况下仍然获胜。LLama-3 的得分是 8:2！

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/42d8f010f0c29f4336abd8c1fb18705a.png

70B 模型的得分，图片由作者提供

对于 70B 模型的 CPU 推理，自然是，不快：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/c10bb81d7b36b68bb2dcecc23de76c32.png

推理时间，图片由作者提供

10 场比赛的批次大约需要一个小时。这个速度不适合生产，但对于测试来说是可以的。有趣的是，Llama-cpp 使用内存映射文件来加载模型，即使对于两个并行的 70B 模型，内存消耗也没有超过 12GB。这意味着即使两个 70B 模型也可以在只有 16GB RAM 的 PC 上测试（遗憾的是，这个技巧对 GPU 不起作用）。

结论

在这篇文章中，我制作了两个语言模型之间的井字棋游戏。有趣的是，这个“基准”具有挑战性。它不仅需要理解游戏规则，还需要理解坐标，以及在字符串形式中表示 2D 棋盘时的一些“空间”和“抽象思维”。

关于结果，LLaMA-3 是明显的赢家。这个模型无疑是更好的，但我必须承认，这两个模型都犯了很多错误。这很有趣，这意味着这个小型的、非官方的“基准”对 LLM 来说很难，也可以用来测试未来的模型。

感谢阅读。如果您喜欢这个故事，请随意订阅Medium，您将在我发布新文章时收到通知，以及访问来自其他作者成千上万故事的完整权限。您也可以通过LinkedIn与我建立联系。如果您想获取这篇文章和其他文章的完整源代码，请随意访问我的Patreon 页面。

对于那些对使用语言模型和自然语言处理感兴趣的人来说，也欢迎阅读其他文章：

Llama-2 与 Llama-3：模型之间的井字棋对决

优质文章学习记录

加载模型

提示

编码游戏

结果

结论

Read more

从零开始：Xilinx FPGA实现RISC-V五级流水线CPU手把手教程

零基础学微信小程序前端（原生JS）：从0到1写第一个可交互页面

基于 Vue 3 构建企业级 Web Components 组件库

不仅是记忆：设计前端侧的AI对话历史存储与上下文回溯方案