LLaMA 4與GPT-4o對比：哪個更適合RAG？

隨著大型語言模型（LLMs）的不斷快速發展，其最受矚目的應用之一就是 RAG 系統。檢索增強生成系統（RAG）將這些模型與外部資訊源連線起來，從而提高了它們的可用性。這有助於將它們的答案建立在事實基礎上，使其更加可靠。在本文中，我們將比較兩個著名模型的效能和準確性： Meta 的 LLaMA 4 Scout 和 OpenAI 的 GPT-4o 在 RAG 系統中的效能和準確性。我們將首先使用 LangChain、FAISS 和 FastEmbed 等工具構建一個 RAG 系統，然後使用 RAGAS 框架進行評估和 LLaMA 4 與 GPT-4o 的比較。

瞭解模型

在深入比較之前，讓我們簡要介紹一下這兩種模型：

LLaMA 4 Scout

Llama 4 Scout 是 Meta 最新發布的 LLaMA 4 系列中最高效的模型。該模型在基準測試中表現出色，最多可處理 1000 萬個 token，規模相當大。與其他一些模型相比，它在處理敏感問題時被拒絕的次數也較少。Groq API 上的 LLaMA 4 也因其推理速度而備受關注。

由於 Meta 公開發布了權重，開發人員可以檢查和使用其預訓練引數。這種透明度使其對研究和定製開發具有吸引力。

推薦閱讀： 如何透過 API 訪問 Meta 的 Llama 4 模型

GPT-4o

GPT-4o 代表了 OpenAI 在 GPT 系列中的最新進展。它在推理能力、編碼任務和響應的整體質量方面都有所改進。它的設計旨在高效利用計算資源，同時與其他頂級模型展開激烈競爭。

什麼是RAGAS？

評估 RAG 系統包括檢查它檢索資訊的能力，以及根據資訊生成答案的能力。僅僅檢視最終答案是不夠的。

RAGAS（Retrieval-Augmented Generation Assessment Suite，檢索增強生成評估套件）提供了評估 RAG 流程不同部分的指標，而不需要預先寫好的完美答案。RAGAS 中使用的關鍵指標包括

忠實性：生成的答案是否準確地代表了檢索文件中的資訊？
答案相關性：答案是否真的與所提問題相關？
上下文精確度和召回率：檢索步驟的有效性如何？是否找到了相關資訊？

利用這些指標，我們可以更清楚地瞭解 RAG 系統的優勢和不足。現在，讓我們看看如何使用 RAGAS 實施 RAG 和評估模型。

使用RAGAS實施和評估RAG

在本節中，我們將首先深入研究 Jupyter Notebook 中用於設定 RAG 管道的步驟和相關程式碼。我們將透過 Groq 平臺包含使用 GPT-4o 和 LLaMA 4 Scout 的聊天例項。然後，我們將在兩個 RAG 系統上執行 RAGAS 評估。

構建RAG系統

以下是使用 GPT-4o 和 LLaMA 4 構建 RAG 系統的步驟。

使用 GPT-4o 和 LLaMA 4 構建 RAG 系統

1. 安裝必要的庫

首先，我們需要為 LangChain、Groq、OpenAI、向量儲存 (FAISS)、PDF 處理 (PyMuPDF)、嵌入 (FastEmbed) 和評估 (Ragas) 安裝所需的 Python 包。

!pip install -q langchain_groq langchain_community faiss-cpu pymupdf langchain fastembed langchain-openai

!pip install -q langchain_groq langchain_community faiss-cpu pymupdf langchain fastembed langchain-openai

2. 設定 API 金鑰

接下來，我們必須為 OpenAI 和 Groq 配置 API 金鑰。程式碼使用 Google Colab 的使用者資料功能進行安全金鑰管理。

import os

os.environ["OPENAI_API_KEY"] = “your_openai_api”

os.environ["GROQ_API_KEY"] = “your_groq_api”

import os os.environ["OPENAI_API_KEY"] = “your_openai_api” os.environ["GROQ_API_KEY"] = “your_groq_api”

import os
os.environ["OPENAI_API_KEY"] = “your_openai_api”
os.environ["GROQ_API_KEY"] = “your_groq_api”

3. 匯入程式庫

現在，我們將從已安裝的庫中匯入所需的特定類和函式。

import os

import fitz

import numpy as np

import faiss

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from datasets import Dataset

from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_core.prompts import PromptTemplate

from langchain_core.output_parsers import StrOutputParser

from langchain_openai import ChatOpenAI

from langchain_groq import ChatGroq

from ragas import evaluate

from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

import os import fitz import numpy as np import faiss import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from datasets import Dataset from langchain_community.embeddings.fastembed import FastEmbedEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_core.prompts import PromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_openai import ChatOpenAI from langchain_groq import ChatGroq from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

import os
import fitz
import numpy as np
import faiss
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import Dataset
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain_groq import ChatGroq
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

4. 初始化語言模型

現在是主要部分。我們需要建立要比較的聊天模型的例項： GPT-4o 和 LLaMA 4 Scout（透過 Groq）。在設定時，請注意 temperature=1 與 temperature=0 相比，回覆的可變性更大。

chat_model_4o = ChatOpenAI(temperature=1, model_name="gpt-4o")

chat_model_llama = ChatGroq(temperature=1,

model_name="meta-llama/llama-4-scout-17b-16e-instruct")

chat_model_4o = ChatOpenAI(temperature=1, model_name="gpt-4o") chat_model_llama = ChatGroq(temperature=1, model_name="meta-llama/llama-4-scout-17b-16e-instruct")

chat_model_4o = ChatOpenAI(temperature=1, model_name="gpt-4o")
chat_model_llama = ChatGroq(temperature=1, 
model_name="meta-llama/llama-4-scout-17b-16e-instruct")

5. 初始化嵌入模型和文字分割器

初始化完成後，我們就可以建立將文字轉換為向量的模型（FastEmbedEmbeddings）。我們還需要初始化將文件分割成小塊的工具（RecursiveCharacterTextSplitter）。

embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

splitter = RecursiveCharacterTextSplitter(chunk_size=1000,

chunk_overlap=200)

embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5") splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
chunk_overlap=200)

說明

FastEmbedEmbeddings 使用 BAAI/bge-base-en-v1.5 模型初始化，將文字轉換為數字嵌入。
RecursiveCharacterTextSplitter 已設定為建立 1000 個字元的文字塊，並有 200 個字元的重疊。
如果沒有配置 HF 標記，則會出現擁抱臉警告，但不會影響 BGE 等公共模型。

6. 載入和分塊文件

這段程式碼將從指定資料資料夾中的 PDF 檔案中提取文字，並將提取的文字分割成易於管理的塊（您可以用自己的 PDF 檔案替換）。這裡，我們使用的是 SWE lancer 研究論文。

def extract_text_from_pdf(pdf_path):

doc = fitz.open(pdf_path)

return "\n".join([page.get_text() for page in doc])

folder_path = "./data/"

documents = [extract_text_from_pdf(os.path.join(folder_path, f)) for f in os.listdir(folder_path) if f.endswith(".pdf")]

all_chunks = [chunk for doc in documents for chunk in splitter.split_text(doc)]

def extract_text_from_pdf(pdf_path): doc = fitz.open(pdf_path) return "\n".join([page.get_text() for page in doc]) folder_path = "./data/" documents = [extract_text_from_pdf(os.path.join(folder_path, f)) for f in os.listdir(folder_path) if f.endswith(".pdf")] all_chunks = [chunk for doc in documents for chunk in splitter.split_text(doc)]

def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
return "\n".join([page.get_text() for page in doc])
folder_path = "./data/"
documents = [extract_text_from_pdf(os.path.join(folder_path, f)) for f in os.listdir(folder_path) if f.endswith(".pdf")]
all_chunks = [chunk for doc in documents for chunk in splitter.split_text(doc)]

解釋：

extract_text_from_pdf 函式使用 fitz 庫從 PDF 的所有頁面中提取文字。
它會列出指定資料夾路徑下的 PDF 檔案（確保資料夾和檔案存在）。
函式使用定義的分割器將提取的文字分割成小塊。

7. 建立FAISS向量索引

然後，我們為所有文字塊生成嵌入，併為快速相似性搜尋建立 FAISS 索引。

embeddings = np.array(embed_model.embed_documents(all_chunks))

index = faiss.IndexFlatL2(embeddings.shape[1])

index.add(embeddings)

embeddings = np.array(embed_model.embed_documents(all_chunks)) index = faiss.IndexFlatL2(embeddings.shape[1]) index.add(embeddings)

embeddings = np.array(embed_model.embed_documents(all_chunks))
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

解釋：

檢查是否建立了 all_chunk，並使用 embed_model（FastEmbed BGE）將其轉換為嵌入。
嵌入資訊儲存在一個 NumPy 陣列中，並建立一個 FAISS 索引（IndexFlatL2）用於相似性搜尋。
該索引由嵌入式資料填充，並對空塊或嵌入式資料進行錯誤處理。

8. 定義RAG核心函式（檢索和回答）

這些函式實現了 RAG 的核心邏輯：根據查詢檢索相關資料塊，並以這些資料塊為上下文使用 LLM 生成答案。

def retrieve_chunks(query, k=1):

query_embedding = np.array([embed_model.embed_query(query)])

_, I = index.search(query_embedding, k)

return [all_chunks[i] for i in I[0]]

def rag_answer(model, query, retrieved_docs):

prompt = PromptTemplate(

input_variables=["document", "question"],

template="""

You are a helpful AI assistant.

Use the CONTENT below to answer the QUESTION.

If the answer isn't in the content, reply: "I don't have the answer to the question."

CONTENT: {document}

QUESTION: {question}

"""

)

chain = prompt | model | StrOutputParser()

return chain.invoke({"document": "\n".join(retrieved_docs), "question": query})

def retrieve_chunks(query, k=1): query_embedding = np.array([embed_model.embed_query(query)]) _, I = index.search(query_embedding, k) return [all_chunks[i] for i in I[0]] def rag_answer(model, query, retrieved_docs): prompt = PromptTemplate( input_variables=["document", "question"], template=""" You are a helpful AI assistant. Use the CONTENT below to answer the QUESTION. If the answer isn't in the content, reply: "I don't have the answer to the question." CONTENT: {document} QUESTION: {question} """ ) chain = prompt | model | StrOutputParser() return chain.invoke({"document": "\n".join(retrieved_docs), "question": query})

def retrieve_chunks(query, k=1):
query_embedding = np.array([embed_model.embed_query(query)])
_, I = index.search(query_embedding, k)
return [all_chunks[i] for i in I[0]]
def rag_answer(model, query, retrieved_docs):
prompt = PromptTemplate(
input_variables=["document", "question"],
template="""
You are a helpful AI assistant.
Use the CONTENT below to answer the QUESTION.
If the answer isn't in the content, reply: "I don't have the answer to the question."
CONTENT: {document}
QUESTION: {question}
"""
)
chain = prompt | model | StrOutputParser()
return chain.invoke({"document": "\n".join(retrieved_docs), "question": query})

解釋

retrieve_chunks 將查詢轉換為嵌入，並使用 FAISS 索引找到最接近的 k 個向量。
它會返回與最接近向量的索引相對應的文字塊。
rag_answer 定義了一個提示模板，將其與模型和解析器相結合，並處理空檢索結果。

現在，我們已經有了由 GPT-4o 和 LLaMA 4 支援的 RAG 系統，可以進行測試了。

使用RAGAS評估RAG系統

現在我們開始使用 RAGAS 進行評估。我們的目標是瞭解每個模型在特定設定中的表現，並根據觀察到的結果獲得實用的見解。以下是相關步驟：

使用 RAGAS 評估 RAG 系統

1. 確定評估問題和參考

為此，我們首先需要設定具體問題和相應的地面實況（參考）答案。

questions = [

"What is the main goal of the SWE Lancer system?",

"What problem does the SWE Lancer paper try to solve?",

"What are the key features of the SWE Lancer system?",

]

references = [

"The main goal of the SWE Lancer system is to improve software engineering productivity and automation.",

"The paper addresses the problem of inefficient software engineering workflows and proposes a machine learning-based solution.",

"Key features include modular design, machine learning integration, and scalability.",

]

questions = [ "What is the main goal of the SWE Lancer system?", "What problem does the SWE Lancer paper try to solve?", "What are the key features of the SWE Lancer system?", ] references = [ "The main goal of the SWE Lancer system is to improve software engineering productivity and automation.", "The paper addresses the problem of inefficient software engineering workflows and proposes a machine learning-based solution.", "Key features include modular design, machine learning integration, and scalability.", ]

questions = [
"What is the main goal of the SWE Lancer system?",
"What problem does the SWE Lancer paper try to solve?",
"What are the key features of the SWE Lancer system?",
]
references = [
"The main goal of the SWE Lancer system is to improve software engineering productivity and automation.",
"The paper addresses the problem of inefficient software engineering workflows and proposes a machine learning-based solution.",
"Key features include modular design, machine learning integration, and scalability.",
]

2. 測試 RAG 答案生成（單個查詢）

在進行全面評估之前，讓我們先用兩個模型測試 rag_answer 函式對單個問題的原始輸出。

GPT-4o 測試：

rag_answer(chat_model_4o, questions[2], retrieve_chunks(questions[2], k=1))

rag_answer(chat_model_4o, questions[2], retrieve_chunks(questions[2], k=1))

輸出：

我沒有問題的答案。

解釋

使用 GPT-4o、第三個問題和與該問題最相關的語塊呼叫rag_answer函式。
GPT-4o 使用檢索到的上下文來回答，但如果上下文不夠，它就會說明沒有答案。
該模型遵循提示指令，並在內容不相關時進行確認。

LLaMA 4 Scout 測試：

rag_answer(chat_model_llama, questions[2], retrieve_chunks(questions[2], k=1))

rag_answer(chat_model_llama, questions[2], retrieve_chunks(questions[2], k=1))

輸出：

SWE-Lancer 系統的主要特點是：\n\n1. 它依賴於一套全面的測試案例，而不是少數經過選擇的案例。它在本質上更能抵制作弊。它可以準確地反映一個模型為現實世界的工程挑戰提供真正的、有經濟價值的解決方案的能力。

解釋

使用 chat_model_llama（LLaMA 4 Scout，透過 Groq）呼叫 rag_answer，獲得相同的問題和檢索塊。
LLaMA 4 會生成一個答案，答案可能來自檢索到的語塊，也可能是根據上下文推斷出來的。
與 GPT-4o 不同的是，即使檢索到的上下文不完全相關，LLaMA 4 也能提供答案。

3. 定義完整評估函式（evaluate_model）

該函式捆綁了針對給定模型透過 RAG 管道執行所有問題的過程，然後使用 RAGAS 對結果進行評分。

def evaluate_model(model, model_name):

answers, contexts = [], []

for q in questions:

docs = retrieve_chunks(q, k=1)

ans = rag_answer(model, q, docs)

answers.append(ans)

contexts.append(docs)

dataset = Dataset.from_dict({

"question": questions,

"answer": answers,

"contexts": contexts,

"reference": references, # required for some RAGAS metrics

})

metrics = [context_precision, context_recall, faithfulness, answer_relevancy]

result = evaluate(dataset=dataset, metrics=metrics)

df = result.to_pandas()

df["model"] = model_name

print(f"RAG OUTPUT FOR {model_name}:")

for q, a in zip(questions, answers):

print(f"\nQ: {q}\nA: {a}")

return df

def evaluate_model(model, model_name): answers, contexts = [], [] for q in questions: docs = retrieve_chunks(q, k=1) ans = rag_answer(model, q, docs) answers.append(ans) contexts.append(docs) dataset = Dataset.from_dict({ "question": questions, "answer": answers, "contexts": contexts, "reference": references, # required for some RAGAS metrics }) metrics = [context_precision, context_recall, faithfulness, answer_relevancy] result = evaluate(dataset=dataset, metrics=metrics) df = result.to_pandas() df["model"] = model_name print(f"RAG OUTPUT FOR {model_name}:") for q, a in zip(questions, answers): print(f"\nQ: {q}\nA: {a}") return df

def evaluate_model(model, model_name):
answers, contexts = [], []
for q in questions:
docs = retrieve_chunks(q, k=1)
ans = rag_answer(model, q, docs)
answers.append(ans)
contexts.append(docs)
dataset = Dataset.from_dict({
"question": questions,
"answer": answers,
"contexts": contexts,
"reference": references,  # required for some RAGAS metrics
})
metrics = [context_precision, context_recall, faithfulness, answer_relevancy]
result = evaluate(dataset=dataset, metrics=metrics)
df = result.to_pandas()
df["model"] = model_name
print(f"RAG OUTPUT FOR {model_name}:")
for q, a in zip(questions, answers):
print(f"\nQ: {q}\nA: {a}")
return df

解釋

遍歷每個問題，檢索最重要的語塊，並使用模型和 rag_answer 生成答案。
在 datasets.Dataset 中儲存答案和上下文，計算評估指標，並呼叫 ragas.evaluate。
結果會以 pandas DataFrame（包含模型名稱和原始問答輸出）組織起來，並返回分數。

4. 執行完整評估並顯示結果

我們對兩個模型執行 evaluate_model 函式，並顯示結果 DataFrame，其中包含 RAGAS 分數。

gpt4o_df = evaluate_model(chat_model_4o, "GPT-4o")

llama_df = evaluate_model(chat_model_llama, "LLaMA-4")

gpt4o_df = evaluate_model(chat_model_4o, "GPT-4o") llama_df = evaluate_model(chat_model_llama, "LLaMA-4")

gpt4o_df = evaluate_model(chat_model_4o, "GPT-4o")
llama_df = evaluate_model(chat_model_llama, "LLaMA-4")

輸出

執行完整評估並顯示結果

llama_df

llama_df

Gpt4o_df

Gpt4o_df

解釋：

使用 evaluate_model 對 GPT-4o 和 LLaMA 4 Scout 進行評估，顯示 RAGAS 進度和原始 Q&A 輸出。
資料幀（gpt4o_df 和 llama_df）顯示 context_precision 和 context_recall 為 0.0，表明檢索失敗。
GPT-4o 的忠實度較低（由於拒絕），但 LLaMA 4 的忠實度較高（答案一致）；LLaMA 4 的答案相關性較高。

現在測試部分已經完成，讓我們來看看結果。

LLaMA 4與GPT-4o：結果與分析

透過 RAGAS 評估，程式碼的執行提供了明確的量化結果。

定性觀察

LLaMA 4 Scout：從 RAG 輸出部分和單項測試中可以看出，該模型為所有問題生成了答案，即使檢索到的上下文可能不充分或不相關（RAGAS 分數顯示）。它提供的答案聽起來與所提問題相關。

GPT-4o：始終回答“我沒有問題的答案”。這與在所提供的上下文中找不到答案時的提示指令一致，表明它正確地識別出檢索到的上下文對回答具體問題沒有幫助。

量化總結

下面是 RAGAS 資料框（gpt4_df、llama_df）顯示的摘要：

指標值	LLaMA 4 Scout (Avg)	GPT-4o (Avg)	解釋說明
上下文精確度	0.0	0.0	檢索未找到相關資料塊。
上下文召回率	0.0	0.0	檢索未找到相關語塊。
忠實度	1.0	~0.33 (Variable)	LLaMA 停留在（不相關的）上下文中。GPT-4o 拒絕。
答案相關性	~0.996	0.0	LLaMA 的回答聽起來相關。GPT-4o 沒有回答。

結果解讀

透過解讀 RAGAS 分數，我們可以深入瞭解 LLaMA 4 與 GPT-4o 在處理檢索失敗這一特定測試中的表現。

LLaMA 4 Scout的行為

儘管語境不佳，但 LLaMA 4 生成的答案被 RAGAS 認為高度相關（答案相關性 ~0.996）且完全忠實（忠實度 1.0）。這意味著它的答案雖然可能是基於其內部知識而非檢索到的文字，但與所提供的單一（不相關）語塊一致，而且聽起來與問題相符。它優先考慮生成一個可信的答案。

GPT-4o 的行為

GPT-4o 嚴格遵守提示指令，只根據上下文作答。由於上下文毫無用處（精確度/召回率為 0.0），它正確地拒絕回答，導致答案相關性為 0.0。這凸顯了 GPT-4o 與 LLaMA 4 在缺少上下文時的準確性策略上的明顯差異；GPT-4o 更傾向於保持沉默，而不是因檢索不準確而可能造成的不準確。GPT-4o 的平均忠實度得分較低，這反映出 RAGAS 有時會對這些拒絕進行懲罰，儘管在語境不佳的情況下，拒絕本身是忠實於指令的。它優先考慮事實基礎和避免幻覺。

小結

本實驗使用 RAGAS 框架，在特定的 RAG 設定上比較了 LLaMA 4 和 GPT-4o。透過實際測試，我們清楚地展示了 LLaMA 4 Scout 和 GPT-4o 之間的不同行為，尤其是在遇到檢索失敗時。

LLaMA 4 Scout 顯示出一種傾向，即即使在上下文不充分的情況下，也能生成聽起來合理、相關的答案。這一特點可能適用於頭腦風暴等風險較低的應用。相反，GPT-4o 則表現出對指令的嚴格遵守，拒絕在沒有足夠檢索資訊的情況下生成答案。這種保守的方法使其更適合要求高可靠性和最小幻覺的應用場景。

事實證明，RAGAS 框架非常重要，它不僅能對輸出結果進行評分，還能找出檢索步驟失敗的根本原因（上下文精確度/召回率 = 0.0），從而解釋觀察到的模型響應差異。利用這種設定，您可以比較任何 LLM 在實際用例中的效能。

LLaMA 4與GPT-4o對比：哪個更適合RAG？

瞭解模型

什麼是RAGAS？

使用RAGAS實施和評估RAG

構建RAG系統

1. 安裝必要的庫

2. 設定 API 金鑰

3. 匯入程式庫

4. 初始化語言模型

5. 初始化嵌入模型和文字分割器

6. 載入和分塊文件

7. 建立FAISS向量索引

8. 定義RAG核心函式（檢索和回答）

使用RAGAS評估RAG系統

1. 確定評估問題和參考

2. 測試 RAG 答案生成（單個查詢）

3. 定義完整評估函式（evaluate_model）

4. 執行完整評估並顯示結果

LLaMA 4與GPT-4o：結果與分析

定性觀察

量化總結

結果解讀

小結

評論留言

取消回覆

文章目录

LLaMA 4與GPT-4o對比：哪個更適合RAG？

瞭解模型

什麼是RAGAS？

使用RAGAS實施和評估RAG

構建RAG系統

1. 安裝必要的庫

2. 設定 API 金鑰

3. 匯入程式庫

4. 初始化語言模型

5. 初始化嵌入模型和文字分割器

6. 載入和分塊文件

7. 建立FAISS向量索引

8. 定義RAG核心函式（檢索和回答）

使用RAGAS評估RAG系統

1. 確定評估問題和參考

2. 測試 RAG 答案生成（單個查詢）

3. 定義完整評估函式（evaluate_model）

4. 執行完整評估並顯示結果

LLaMA 4與GPT-4o：結果與分析

定性觀察

量化總結

結果解讀

小結

相關文章

評論留言

取消回覆

文章目录