LLaMA 4与GPT-4o对比：哪个更适合RAG？

随着大型语言模型（LLMs）的不断快速发展，其最受瞩目的应用之一就是 RAG 系统。检索增强生成系统（RAG）将这些模型与外部信息源连接起来，从而提高了它们的可用性。这有助于将它们的答案建立在事实基础上，使其更加可靠。在本文中，我们将比较两个著名模型的性能和准确性： Meta 的 LLaMA 4 Scout 和 OpenAI 的 GPT-4o 在 RAG 系统中的性能和准确性。我们将首先使用 LangChain、FAISS 和 FastEmbed 等工具构建一个 RAG 系统，然后使用 RAGAS 框架进行评估和 LLaMA 4 与 GPT-4o 的比较。

了解模型

在深入比较之前，让我们简要介绍一下这两种模型：

LLaMA 4 Scout

Llama 4 Scout 是 Meta 最新发布的 LLaMA 4 系列中最高效的模型。该模型在基准测试中表现出色，最多可处理 1000 万个 token，规模相当大。与其他一些模型相比，它在处理敏感问题时被拒绝的次数也较少。Groq API 上的 LLaMA 4 也因其推理速度而备受关注。

由于 Meta 公开发布了权重，开发人员可以检查和使用其预训练参数。这种透明度使其对研究和定制开发具有吸引力。

推荐阅读： 如何通过 API 访问 Meta 的 Llama 4 模型

GPT-4o

GPT-4o 代表了 OpenAI 在 GPT 系列中的最新进展。它在推理能力、编码任务和响应的整体质量方面都有所改进。它的设计旨在高效利用计算资源，同时与其他顶级模型展开激烈竞争。

什么是RAGAS？

评估 RAG 系统包括检查它检索信息的能力，以及根据信息生成答案的能力。仅仅查看最终答案是不够的。

RAGAS（Retrieval-Augmented Generation Assessment Suite，检索增强生成评估套件）提供了评估 RAG 流程不同部分的指标，而不需要预先写好的完美答案。RAGAS 中使用的关键指标包括

忠实性：生成的答案是否准确地代表了检索文档中的信息？
答案相关性：答案是否真的与所提问题相关？
上下文精确度和召回率：检索步骤的有效性如何？是否找到了相关信息？

利用这些指标，我们可以更清楚地了解 RAG 系统的优势和不足。现在，让我们看看如何使用 RAGAS 实施 RAG 和评估模型。

使用RAGAS实施和评估RAG

在本节中，我们将首先深入研究 Jupyter Notebook 中用于设置 RAG 管道的步骤和相关代码。我们将通过 Groq 平台包含使用 GPT-4o 和 LLaMA 4 Scout 的聊天实例。然后，我们将在两个 RAG 系统上运行 RAGAS 评估。

构建RAG系统

以下是使用 GPT-4o 和 LLaMA 4 构建 RAG 系统的步骤。

使用 GPT-4o 和 LLaMA 4 构建 RAG 系统

1. 安装必要的库

首先，我们需要为 LangChain、Groq、OpenAI、向量存储 (FAISS)、PDF 处理 (PyMuPDF)、嵌入 (FastEmbed) 和评估 (Ragas) 安装所需的 Python 包。

!pip install -q langchain_groq langchain_community faiss-cpu pymupdf langchain fastembed langchain-openai

!pip install -q langchain_groq langchain_community faiss-cpu pymupdf langchain fastembed langchain-openai

2. 设置 API 密钥

接下来，我们必须为 OpenAI 和 Groq 配置 API 密钥。代码使用 Google Colab 的用户数据功能进行安全密钥管理。

import os

os.environ["OPENAI_API_KEY"] = “your_openai_api”

os.environ["GROQ_API_KEY"] = “your_groq_api”

import os os.environ["OPENAI_API_KEY"] = “your_openai_api” os.environ["GROQ_API_KEY"] = “your_groq_api”

import os
os.environ["OPENAI_API_KEY"] = “your_openai_api”
os.environ["GROQ_API_KEY"] = “your_groq_api”

3. 导入程序库

现在，我们将从已安装的库中导入所需的特定类和函数。

import os

import fitz

import numpy as np

import faiss

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from datasets import Dataset

from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_core.prompts import PromptTemplate

from langchain_core.output_parsers import StrOutputParser

from langchain_openai import ChatOpenAI

from langchain_groq import ChatGroq

from ragas import evaluate

from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

import os import fitz import numpy as np import faiss import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from datasets import Dataset from langchain_community.embeddings.fastembed import FastEmbedEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_core.prompts import PromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_openai import ChatOpenAI from langchain_groq import ChatGroq from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

import os
import fitz
import numpy as np
import faiss
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import Dataset
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain_groq import ChatGroq
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

4. 初始化语言模型

现在是主要部分。我们需要创建要比较的聊天模型的实例： GPT-4o 和 LLaMA 4 Scout（通过 Groq）。在设置时，请注意 temperature=1 与 temperature=0 相比，回复的可变性更大。

chat_model_4o = ChatOpenAI(temperature=1, model_name="gpt-4o")

chat_model_llama = ChatGroq(temperature=1,

model_name="meta-llama/llama-4-scout-17b-16e-instruct")

chat_model_4o = ChatOpenAI(temperature=1, model_name="gpt-4o") chat_model_llama = ChatGroq(temperature=1, model_name="meta-llama/llama-4-scout-17b-16e-instruct")

chat_model_4o = ChatOpenAI(temperature=1, model_name="gpt-4o")
chat_model_llama = ChatGroq(temperature=1, 
model_name="meta-llama/llama-4-scout-17b-16e-instruct")

5. 初始化嵌入模型和文本分割器

初始化完成后，我们就可以建立将文本转换为向量的模型（FastEmbedEmbeddings）。我们还需要初始化将文档分割成小块的工具（RecursiveCharacterTextSplitter）。

embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

splitter = RecursiveCharacterTextSplitter(chunk_size=1000,

chunk_overlap=200)

embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5") splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
chunk_overlap=200)

说明

FastEmbedEmbeddings 使用 BAAI/bge-base-en-v1.5 模型初始化，将文本转换为数字嵌入。
RecursiveCharacterTextSplitter 已设置为创建 1000 个字符的文本块，并有 200 个字符的重叠。
如果没有配置 HF 标记，则会出现拥抱脸警告，但不会影响 BGE 等公共模型。

6. 加载和分块文档

这段代码将从指定数据文件夹中的 PDF 文件中提取文本，并将提取的文本分割成易于管理的块（您可以用自己的 PDF 文件替换）。这里，我们使用的是 SWE lancer 研究论文。

def extract_text_from_pdf(pdf_path):

doc = fitz.open(pdf_path)

return "\n".join([page.get_text() for page in doc])

folder_path = "./data/"

documents = [extract_text_from_pdf(os.path.join(folder_path, f)) for f in os.listdir(folder_path) if f.endswith(".pdf")]

all_chunks = [chunk for doc in documents for chunk in splitter.split_text(doc)]

def extract_text_from_pdf(pdf_path): doc = fitz.open(pdf_path) return "\n".join([page.get_text() for page in doc]) folder_path = "./data/" documents = [extract_text_from_pdf(os.path.join(folder_path, f)) for f in os.listdir(folder_path) if f.endswith(".pdf")] all_chunks = [chunk for doc in documents for chunk in splitter.split_text(doc)]

def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
return "\n".join([page.get_text() for page in doc])
folder_path = "./data/"
documents = [extract_text_from_pdf(os.path.join(folder_path, f)) for f in os.listdir(folder_path) if f.endswith(".pdf")]
all_chunks = [chunk for doc in documents for chunk in splitter.split_text(doc)]

解释：

extract_text_from_pdf 函数使用 fitz 库从 PDF 的所有页面中提取文本。
它会列出指定文件夹路径下的 PDF 文件（确保文件夹和文件存在）。
函数使用定义的分割器将提取的文本分割成小块。

7. 创建FAISS向量索引

然后，我们为所有文本块生成嵌入，并为快速相似性搜索建立 FAISS 索引。

embeddings = np.array(embed_model.embed_documents(all_chunks))

index = faiss.IndexFlatL2(embeddings.shape[1])

index.add(embeddings)

embeddings = np.array(embed_model.embed_documents(all_chunks)) index = faiss.IndexFlatL2(embeddings.shape[1]) index.add(embeddings)

embeddings = np.array(embed_model.embed_documents(all_chunks))
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

解释：

检查是否创建了 all_chunk，并使用 embed_model（FastEmbed BGE）将其转换为嵌入。
嵌入信息存储在一个 NumPy 数组中，并创建一个 FAISS 索引（IndexFlatL2）用于相似性搜索。
该索引由嵌入式数据填充，并对空块或嵌入式数据进行错误处理。

8. 定义RAG核心函数（检索和回答）

这些函数实现了 RAG 的核心逻辑：根据查询检索相关数据块，并以这些数据块为上下文使用 LLM 生成答案。

def retrieve_chunks(query, k=1):

query_embedding = np.array([embed_model.embed_query(query)])

_, I = index.search(query_embedding, k)

return [all_chunks[i] for i in I[0]]

def rag_answer(model, query, retrieved_docs):

prompt = PromptTemplate(

input_variables=["document", "question"],

template="""

You are a helpful AI assistant.

Use the CONTENT below to answer the QUESTION.

If the answer isn't in the content, reply: "I don't have the answer to the question."

CONTENT: {document}

QUESTION: {question}

"""

)

chain = prompt | model | StrOutputParser()

return chain.invoke({"document": "\n".join(retrieved_docs), "question": query})

def retrieve_chunks(query, k=1): query_embedding = np.array([embed_model.embed_query(query)]) _, I = index.search(query_embedding, k) return [all_chunks[i] for i in I[0]] def rag_answer(model, query, retrieved_docs): prompt = PromptTemplate( input_variables=["document", "question"], template=""" You are a helpful AI assistant. Use the CONTENT below to answer the QUESTION. If the answer isn't in the content, reply: "I don't have the answer to the question." CONTENT: {document} QUESTION: {question} """ ) chain = prompt | model | StrOutputParser() return chain.invoke({"document": "\n".join(retrieved_docs), "question": query})

def retrieve_chunks(query, k=1):
query_embedding = np.array([embed_model.embed_query(query)])
_, I = index.search(query_embedding, k)
return [all_chunks[i] for i in I[0]]
def rag_answer(model, query, retrieved_docs):
prompt = PromptTemplate(
input_variables=["document", "question"],
template="""
You are a helpful AI assistant.
Use the CONTENT below to answer the QUESTION.
If the answer isn't in the content, reply: "I don't have the answer to the question."
CONTENT: {document}
QUESTION: {question}
"""
)
chain = prompt | model | StrOutputParser()
return chain.invoke({"document": "\n".join(retrieved_docs), "question": query})

解释

retrieve_chunks 将查询转换为嵌入，并使用 FAISS 索引找到最接近的 k 个向量。
它会返回与最接近向量的索引相对应的文本块。
rag_answer 定义了一个提示模板，将其与模型和解析器相结合，并处理空检索结果。

现在，我们已经有了由 GPT-4o 和 LLaMA 4 支持的 RAG 系统，可以进行测试了。

使用RAGAS评估RAG系统

现在我们开始使用 RAGAS 进行评估。我们的目标是了解每个模型在特定设置中的表现，并根据观察到的结果获得实用的见解。以下是相关步骤：

使用 RAGAS 评估 RAG 系统

1. 确定评估问题和参考

为此，我们首先需要设置具体问题和相应的地面实况（参考）答案。

questions = [

"What is the main goal of the SWE Lancer system?",

"What problem does the SWE Lancer paper try to solve?",

"What are the key features of the SWE Lancer system?",

]

references = [

"The main goal of the SWE Lancer system is to improve software engineering productivity and automation.",

"The paper addresses the problem of inefficient software engineering workflows and proposes a machine learning-based solution.",

"Key features include modular design, machine learning integration, and scalability.",

]

questions = [ "What is the main goal of the SWE Lancer system?", "What problem does the SWE Lancer paper try to solve?", "What are the key features of the SWE Lancer system?", ] references = [ "The main goal of the SWE Lancer system is to improve software engineering productivity and automation.", "The paper addresses the problem of inefficient software engineering workflows and proposes a machine learning-based solution.", "Key features include modular design, machine learning integration, and scalability.", ]

questions = [
"What is the main goal of the SWE Lancer system?",
"What problem does the SWE Lancer paper try to solve?",
"What are the key features of the SWE Lancer system?",
]
references = [
"The main goal of the SWE Lancer system is to improve software engineering productivity and automation.",
"The paper addresses the problem of inefficient software engineering workflows and proposes a machine learning-based solution.",
"Key features include modular design, machine learning integration, and scalability.",
]

2. 测试 RAG 答案生成（单个查询）

在进行全面评估之前，让我们先用两个模型测试 rag_answer 函数对单个问题的原始输出。

GPT-4o 测试：

rag_answer(chat_model_4o, questions[2], retrieve_chunks(questions[2], k=1))

rag_answer(chat_model_4o, questions[2], retrieve_chunks(questions[2], k=1))

输出：

我没有问题的答案。

解释

使用 GPT-4o、第三个问题和与该问题最相关的语块调用rag_answer函数。
GPT-4o 使用检索到的上下文来回答，但如果上下文不够，它就会说明没有答案。
该模型遵循提示指令，并在内容不相关时进行确认。

LLaMA 4 Scout 测试：

rag_answer(chat_model_llama, questions[2], retrieve_chunks(questions[2], k=1))

rag_answer(chat_model_llama, questions[2], retrieve_chunks(questions[2], k=1))

输出：

SWE-Lancer 系统的主要特点是：\n\n1. 它依赖于一套全面的测试案例，而不是少数经过选择的案例。它在本质上更能抵制作弊。它可以准确地反映一个模型为现实世界的工程挑战提供真正的、有经济价值的解决方案的能力。

解释

使用 chat_model_llama（LLaMA 4 Scout，通过 Groq）调用 rag_answer，获得相同的问题和检索块。
LLaMA 4 会生成一个答案，答案可能来自检索到的语块，也可能是根据上下文推断出来的。
与 GPT-4o 不同的是，即使检索到的上下文不完全相关，LLaMA 4 也能提供答案。

3. 定义完整评估函数（evaluate_model）

该函数捆绑了针对给定模型通过 RAG 管道运行所有问题的过程，然后使用 RAGAS 对结果进行评分。

def evaluate_model(model, model_name):

answers, contexts = [], []

for q in questions:

docs = retrieve_chunks(q, k=1)

ans = rag_answer(model, q, docs)

answers.append(ans)

contexts.append(docs)

dataset = Dataset.from_dict({

"question": questions,

"answer": answers,

"contexts": contexts,

"reference": references, # required for some RAGAS metrics

})

metrics = [context_precision, context_recall, faithfulness, answer_relevancy]

result = evaluate(dataset=dataset, metrics=metrics)

df = result.to_pandas()

df["model"] = model_name

print(f"RAG OUTPUT FOR {model_name}:")

for q, a in zip(questions, answers):

print(f"\nQ: {q}\nA: {a}")

return df

def evaluate_model(model, model_name): answers, contexts = [], [] for q in questions: docs = retrieve_chunks(q, k=1) ans = rag_answer(model, q, docs) answers.append(ans) contexts.append(docs) dataset = Dataset.from_dict({ "question": questions, "answer": answers, "contexts": contexts, "reference": references, # required for some RAGAS metrics }) metrics = [context_precision, context_recall, faithfulness, answer_relevancy] result = evaluate(dataset=dataset, metrics=metrics) df = result.to_pandas() df["model"] = model_name print(f"RAG OUTPUT FOR {model_name}:") for q, a in zip(questions, answers): print(f"\nQ: {q}\nA: {a}") return df

def evaluate_model(model, model_name):
answers, contexts = [], []
for q in questions:
docs = retrieve_chunks(q, k=1)
ans = rag_answer(model, q, docs)
answers.append(ans)
contexts.append(docs)
dataset = Dataset.from_dict({
"question": questions,
"answer": answers,
"contexts": contexts,
"reference": references,  # required for some RAGAS metrics
})
metrics = [context_precision, context_recall, faithfulness, answer_relevancy]
result = evaluate(dataset=dataset, metrics=metrics)
df = result.to_pandas()
df["model"] = model_name
print(f"RAG OUTPUT FOR {model_name}:")
for q, a in zip(questions, answers):
print(f"\nQ: {q}\nA: {a}")
return df

解释

遍历每个问题，检索最重要的语块，并使用模型和 rag_answer 生成答案。
在 datasets.Dataset 中存储答案和上下文，计算评估指标，并调用 ragas.evaluate。
结果会以 pandas DataFrame（包含模型名称和原始问答输出）组织起来，并返回分数。

4. 运行完整评估并显示结果

我们对两个模型执行 evaluate_model 函数，并显示结果 DataFrame，其中包含 RAGAS 分数。

gpt4o_df = evaluate_model(chat_model_4o, "GPT-4o")

llama_df = evaluate_model(chat_model_llama, "LLaMA-4")

gpt4o_df = evaluate_model(chat_model_4o, "GPT-4o") llama_df = evaluate_model(chat_model_llama, "LLaMA-4")

gpt4o_df = evaluate_model(chat_model_4o, "GPT-4o")
llama_df = evaluate_model(chat_model_llama, "LLaMA-4")

输出

运行完整评估并显示结果

llama_df

llama_df

Gpt4o_df

Gpt4o_df

解释：

使用 evaluate_model 对 GPT-4o 和 LLaMA 4 Scout 进行评估，显示 RAGAS 进度和原始 Q&A 输出。
数据帧（gpt4o_df 和 llama_df）显示 context_precision 和 context_recall 为 0.0，表明检索失败。
GPT-4o 的忠实度较低（由于拒绝），但 LLaMA 4 的忠实度较高（答案一致）；LLaMA 4 的答案相关性较高。

现在测试部分已经完成，让我们来看看结果。

LLaMA 4与GPT-4o：结果与分析

通过 RAGAS 评估，代码的执行提供了明确的量化结果。

定性观察

LLaMA 4 Scout：从 RAG 输出部分和单项测试中可以看出，该模型为所有问题生成了答案，即使检索到的上下文可能不充分或不相关（RAGAS 分数显示）。它提供的答案听起来与所提问题相关。

GPT-4o：始终回答“我没有问题的答案”。这与在所提供的上下文中找不到答案时的提示指令一致，表明它正确地识别出检索到的上下文对回答具体问题没有帮助。

量化总结

下面是 RAGAS 数据框（gpt4_df、llama_df）显示的摘要：

指标值	LLaMA 4 Scout (Avg)	GPT-4o (Avg)	解释说明
上下文精确度	0.0	0.0	检索未找到相关数据块。
上下文召回率	0.0	0.0	检索未找到相关语块。
忠实度	1.0	~0.33 (Variable)	LLaMA 停留在（不相关的）上下文中。GPT-4o 拒绝。
答案相关性	~0.996	0.0	LLaMA 的回答听起来相关。GPT-4o 没有回答。

结果解读

通过解读 RAGAS 分数，我们可以深入了解 LLaMA 4 与 GPT-4o 在处理检索失败这一特定测试中的表现。

LLaMA 4 Scout的行为

尽管语境不佳，但 LLaMA 4 生成的答案被 RAGAS 认为高度相关（答案相关性 ~0.996）且完全忠实（忠实度 1.0）。这意味着它的答案虽然可能是基于其内部知识而非检索到的文本，但与所提供的单一（不相关）语块一致，而且听起来与问题相符。它优先考虑生成一个可信的答案。

GPT-4o 的行为

GPT-4o 严格遵守提示指令，只根据上下文作答。由于上下文毫无用处（精确度/召回率为 0.0），它正确地拒绝回答，导致答案相关性为 0.0。这凸显了 GPT-4o 与 LLaMA 4 在缺少上下文时的准确性策略上的明显差异；GPT-4o 更倾向于保持沉默，而不是因检索不准确而可能造成的不准确。GPT-4o 的平均忠实度得分较低，这反映出 RAGAS 有时会对这些拒绝进行惩罚，尽管在语境不佳的情况下，拒绝本身是忠实于指令的。它优先考虑事实基础和避免幻觉。

小结

本实验使用 RAGAS 框架，在特定的 RAG 设置上比较了 LLaMA 4 和 GPT-4o。通过实际测试，我们清楚地展示了 LLaMA 4 Scout 和 GPT-4o 之间的不同行为，尤其是在遇到检索失败时。

LLaMA 4 Scout 显示出一种倾向，即即使在上下文不充分的情况下，也能生成听起来合理、相关的答案。这一特点可能适用于头脑风暴等风险较低的应用。相反，GPT-4o 则表现出对指令的严格遵守，拒绝在没有足够检索信息的情况下生成答案。这种保守的方法使其更适合要求高可靠性和最小幻觉的应用场景。

事实证明，RAGAS 框架非常重要，它不仅能对输出结果进行评分，还能找出检索步骤失败的根本原因（上下文精确度/召回率 = 0.0），从而解释观察到的模型响应差异。利用这种设置，您可以比较任何 LLM 在实际用例中的性能。

LLaMA 4与GPT-4o对比：哪个更适合RAG？

了解模型

什么是RAGAS？

使用RAGAS实施和评估RAG

构建RAG系统

1. 安装必要的库

2. 设置 API 密钥

3. 导入程序库

4. 初始化语言模型

5. 初始化嵌入模型和文本分割器

6. 加载和分块文档

7. 创建FAISS向量索引

8. 定义RAG核心函数（检索和回答）

使用RAGAS评估RAG系统

1. 确定评估问题和参考

2. 测试 RAG 答案生成（单个查询）

3. 定义完整评估函数（evaluate_model）

4. 运行完整评估并显示结果

LLaMA 4与GPT-4o：结果与分析

定性观察

量化总结

结果解读

小结

评论留言

取消回复

文章目录

LLaMA 4与GPT-4o对比：哪个更适合RAG？

了解模型

什么是RAGAS？

使用RAGAS实施和评估RAG

构建RAG系统

1. 安装必要的库

2. 设置 API 密钥

3. 导入程序库

4. 初始化语言模型

5. 初始化嵌入模型和文本分割器

6. 加载和分块文档

7. 创建FAISS向量索引

8. 定义RAG核心函数（检索和回答）

使用RAGAS评估RAG系统

1. 确定评估问题和参考

2. 测试 RAG 答案生成（单个查询）

3. 定义完整评估函数（evaluate_model）

4. 运行完整评估并显示结果

LLaMA 4与GPT-4o：结果与分析

定性观察

量化总结

结果解读

小结

相关文章

评论留言

取消回复

文章目录