如何使用Hugging Face Evaluate來評估LLM

評估大型語言模型 (LLM)至關重要。您需要了解它們的效能如何，並確保它們符合您的標準。Hugging Face 評估庫為這項任務提供了一套有用的工具。本指南透過實際程式碼示例，向您介紹如何使用評估庫來評估 LLM。

瞭解Hugging Face評估庫

Hugging Face 評估庫提供了滿足不同評估需求的工具。這些工具分為三大類：

度量：這些指標透過比較模型的預測結果和地面實況標籤來衡量模型的效能。例如準確率、F1 分數、BLEU 和 ROUGE。
比較：這有助於對兩個模型進行比較，通常是透過檢查它們的預測如何相互一致或與參考標籤一致。
測量：這些工具研究資料集本身的屬性，如計算文字複雜度或標籤分佈。

您可以使用一個函式訪問所有這些評估模組：evaluate.load()。

開始使用

安裝

首先，您需要安裝該庫。開啟終端或命令提示符並執行

pip install evaluate

pip install rouge_score # Needed for text generation metrics

pip install evaluate[visualization] # For plotting capabilities

pip install evaluate pip install rouge_score # Needed for text generation metrics pip install evaluate[visualization] # For plotting capabilities

pip install evaluate
pip install rouge_score # Needed for text generation metrics
pip install evaluate[visualization] # For plotting capabilities

這些命令安裝了核心 evaluate 庫、rouge_score 軟體包（總結中常用的 ROUGE 指標需要）以及雷達圖等視覺化的可選依賴項。

載入評估模組

要使用特定的評估工具，可按名稱載入。例如，要載入準確度度量，可以使用

import evaluate

accuracy_metric = evaluate.load("accuracy")

print("Accuracy metric loaded.")

import evaluate accuracy_metric = evaluate.load("accuracy") print("Accuracy metric loaded.")

import evaluate
accuracy_metric = evaluate.load("accuracy")
print("Accuracy metric loaded.")

輸出：

這段程式碼將匯入評估庫並載入精確度度量物件。您將使用該物件計算精度分數。

基本評估示例

讓我們來了解一些常見的評估場景。

直接計算準確度

您可以透過一次性提供所有參考資訊（地面實況）和預測結果來計算一個指標。

import evaluate

# Load the accuracy metric

accuracy_metric = evaluate.load("accuracy")

# Sample ground truth and predictions

references = [0, 1, 0, 1]

predictions = [1, 0, 0, 1]

# Compute accuracy

result = accuracy_metric.compute(references=references, predictions=predictions)

print(f"Direct computation result: {result}")

# Example with exact_match metric

exact_match_metric = evaluate.load('exact_match')

match_result = exact_match_metric.compute(references=['hello world'], predictions=['hello world'])

no_match_result = exact_match_metric.compute(references=['hello'], predictions=['hell'])

print(f"Exact match result (match): {match_result}")

print(f"Exact match result (no match): {no_match_result}")

import evaluate # Load the accuracy metric accuracy_metric = evaluate.load("accuracy") # Sample ground truth and predictions references = [0, 1, 0, 1] predictions = [1, 0, 0, 1] # Compute accuracy result = accuracy_metric.compute(references=references, predictions=predictions) print(f"Direct computation result: {result}") # Example with exact_match metric exact_match_metric = evaluate.load('exact_match') match_result = exact_match_metric.compute(references=['hello world'], predictions=['hello world']) no_match_result = exact_match_metric.compute(references=['hello'], predictions=['hell']) print(f"Exact match result (match): {match_result}") print(f"Exact match result (no match): {no_match_result}")

import evaluate
# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")
# Sample ground truth and predictions
references = [0, 1, 0, 1]
predictions = [1, 0, 0, 1]
# Compute accuracy
result = accuracy_metric.compute(references=references, predictions=predictions)
print(f"Direct computation result: {result}")
# Example with exact_match metric
exact_match_metric = evaluate.load('exact_match')
match_result = exact_match_metric.compute(references=['hello world'], predictions=['hello world'])
no_match_result = exact_match_metric.compute(references=['hello'], predictions=['hell'])
print(f"Exact match result (match): {match_result}")
print(f"Exact match result (no match): {no_match_result}")

輸出：

直接計算準確度

解釋：

我們定義了兩個列表：引用儲存正確的標籤，預測儲存模型的輸出。
計算方法使用這些列表計算準確率，並將結果以字典形式返回。
我們還展示了 exact_match 指標，該指標用於檢查預測是否與參考完全匹配。

增量評估（使用 add_batch）

對於大型資料集，分批處理預測會更節省記憶體。您可以增量新增批次，並在最後計算最終得分。

import evaluate

# Load the accuracy metric

accuracy_metric = evaluate.load("accuracy")

# Sample batches of refrences and predictions

references_batch1 = [0, 1]

predictions_batch1 = [1, 0]

references_batch2 = [0, 1]

predictions_batch2 = [0, 1]

# Add batches incrementally

accuracy_metric.add_batch(references=references_batch1, predictions=predictions_batch1)

accuracy_metric.add_batch(references=references_batch2, predictions=predictions_batch2)

# Compute final accuracy

final_result = accuracy_metric.compute()

print(f"Incremental computation result: {final_result}")

import evaluate # Load the accuracy metric accuracy_metric = evaluate.load("accuracy") # Sample batches of refrences and predictions references_batch1 = [0, 1] predictions_batch1 = [1, 0] references_batch2 = [0, 1] predictions_batch2 = [0, 1] # Add batches incrementally accuracy_metric.add_batch(references=references_batch1, predictions=predictions_batch1) accuracy_metric.add_batch(references=references_batch2, predictions=predictions_batch2) # Compute final accuracy final_result = accuracy_metric.compute() print(f"Incremental computation result: {final_result}")

import evaluate
# Load the accuracy metric
accuracy_metric = evaluate.load("accuracy")
# Sample batches of refrences and predictions
references_batch1 = [0, 1]
predictions_batch1 = [1, 0]
references_batch2 = [0, 1]
predictions_batch2 = [0, 1]
# Add batches incrementally
accuracy_metric.add_batch(references=references_batch1, predictions=predictions_batch1)
accuracy_metric.add_batch(references=references_batch2, predictions=predictions_batch2)
# Compute final accuracy
final_result = accuracy_metric.compute()
print(f"Incremental computation result: {final_result}")

輸出：

解釋：

我們模擬分兩批處理資料。
add_batch 會根據每個批次更新度量指標的內部狀態。
呼叫不帶引數的 compute() 會計算所有新增批次的度量指標。

組合多個指標

您經常需要同時計算多個指標（例如分類的準確率、F1、精確度和召回率）。evaluate.combine 函式簡化了這一過程。

import evaluate

# Combine multiple classification metrics

clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

# Sample data

predictions = [0, 1, 0]

references = [0, 1, 1] # Note: The last prediction is incorrect

# Compute all metrics at once

results = clf_metrics.compute(predictions=predictions, references=references)

print(f"Combined metrics result: {results}")

import evaluate # Combine multiple classification metrics clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"]) # Sample data predictions = [0, 1, 0] references = [0, 1, 1] # Note: The last prediction is incorrect # Compute all metrics at once results = clf_metrics.compute(predictions=predictions, references=references) print(f"Combined metrics result: {results}")

import evaluate
# Combine multiple classification metrics
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
# Sample data
predictions = [0, 1, 0]
references = [0, 1, 1] # Note: The last prediction is incorrect
# Compute all metrics at once
results = clf_metrics.compute(predictions=predictions, references=references)
print(f"Combined metrics result: {results}")

輸出：

組合多個指標

解釋：

evaluate.combine 接收一系列指標名稱，並返回一個組合評估物件。
在該物件上呼叫 compute 會使用相同的輸入資料計算所有指定指標。

使用測量

度量可用於分析資料集。下面介紹如何使用 word_length 測量值：

import evaluate

# Load the word_length measurement

# Note: May require NLTK data download on first run

try:

word_length = evaluate.load("word_length", module_type="measurement")

data = ["hello world", "this is another sentence"]

results = word_length.compute(data=data)

print(f"Word length measurement result: {results}")

except Exception as e:

print(f"Could not run word_length measurement, possibly NLTK data missing: {e}")

print("Attempting NLTK download...")

import nltk

nltk.download('punkt') # Uncomment and run if needed

import evaluate # Load the word_length measurement # Note: May require NLTK data download on first run try: word_length = evaluate.load("word_length", module_type="measurement") data = ["hello world", "this is another sentence"] results = word_length.compute(data=data) print(f"Word length measurement result: {results}") except Exception as e: print(f"Could not run word_length measurement, possibly NLTK data missing: {e}") print("Attempting NLTK download...") import nltk nltk.download('punkt') # Uncomment and run if needed

import evaluate
# Load the word_length measurement
# Note: May require NLTK data download on first run
try:
   word_length = evaluate.load("word_length", module_type="measurement")
   data = ["hello world", "this is another sentence"]
   results = word_length.compute(data=data)
   print(f"Word length measurement result: {results}")
except Exception as e:
   print(f"Could not run word_length measurement, possibly NLTK data missing: {e}")
   print("Attempting NLTK download...")
   import nltk
   nltk.download('punkt') # Uncomment and run if needed

輸出：

使用測量

解釋：

我們載入 word_length，並指定 module_type=“測量”。
計算方法將資料集（此處為字串列表）作為輸入。
它會返回所提供資料中單詞長度的統計資料。(注：需要 nltk 及其 “punkt ”標記符資料）。

評估特定的NLP任務

不同的 NLP 任務需要特定的指標。抱抱臉 Evaluate 包含許多標準指標。

機器翻譯（BLEU）

BLEU（Bilingual Evaluation Understudy）是衡量翻譯質量的常用指標。它衡量的是模型翻譯（假設）與參考翻譯之間的 n-gram 重合度。

import evaluate

def evaluate_machine_translation(hypotheses, references):

"""Calculates BLEU score for machine translation."""

bleu_metric = evaluate.load("bleu")

results = bleu_metric.compute(predictions=hypotheses, references=references)

# Extract the main BLEU score

bleu_score = results["bleu"]

return bleu_score

# Example hypotheses (model translations)

hypotheses = ["the cat sat on mat.", "the dog played in garden."]

# Example references (correct translations, can have multiple per hypothesis)

references = [["the cat sat on the mat."], ["the dog played in the garden."]]

bleu_score = evaluate_machine_translation(hypotheses, references)

print(f"BLEU Score: {bleu_score:.4f}") # Format for readability

import evaluate def evaluate_machine_translation(hypotheses, references): """Calculates BLEU score for machine translation.""" bleu_metric = evaluate.load("bleu") results = bleu_metric.compute(predictions=hypotheses, references=references) # Extract the main BLEU score bleu_score = results["bleu"] return bleu_score # Example hypotheses (model translations) hypotheses = ["the cat sat on mat.", "the dog played in garden."] # Example references (correct translations, can have multiple per hypothesis) references = [["the cat sat on the mat."], ["the dog played in the garden."]] bleu_score = evaluate_machine_translation(hypotheses, references) print(f"BLEU Score: {bleu_score:.4f}") # Format for readability

import evaluate
def evaluate_machine_translation(hypotheses, references):
   """Calculates BLEU score for machine translation."""
   bleu_metric = evaluate.load("bleu")
   results = bleu_metric.compute(predictions=hypotheses, references=references)
   # Extract the main BLEU score
   bleu_score = results["bleu"]
   return bleu_score
# Example hypotheses (model translations)
hypotheses = ["the cat sat on mat.", "the dog played in garden."]
# Example references (correct translations, can have multiple per hypothesis)
references = [["the cat sat on the mat."], ["the dog played in the garden."]]
bleu_score = evaluate_machine_translation(hypotheses, references)
print(f"BLEU Score: {bleu_score:.4f}") # Format for readability

輸出： 機器翻譯（BLEU）

解釋：

該函式載入 BLEU 指標。
它將預測的翻譯（假設）與一個或多個正確的參考文獻進行比較，計算出得分。
BLEU 得分越高（接近 1.0），一般表示翻譯質量越好，與參考譯文的重疊度越高。0.51 左右的分值表示有適度的重疊。

命名實體識別（NER-使用seqeval）

對於像 NER 這樣的序列標註任務，精確度、召回率和每個實體型別的 F1 分數等指標都很有用。seqeval 指標可處理這種格式（如 B-PER、I-PER、O 標記）。

要執行以下程式碼，需要 seqeval 庫。執行以下命令即可安裝：

pip install seqeval

pip install seqeval

程式碼：

import evaluate

# Load the seqeval metric

try:

seqeval_metric = evaluate.load("seqeval")

# Example labels (using IOB format)

true_labels = [['O', 'B-PER', 'I-PER', 'O'], ['B-LOC', 'I-LOC', 'O']]

predicted_labels = [['O', 'B-PER', 'I-PER', 'O'], ['B-LOC', 'I-LOC', 'O']] # Example: Perfect prediction here

results = seqeval_metric.compute(predictions=predicted_labels, references=true_labels)

print("Seqeval Results (per entity type):")

# Print results nicely

for key, value in results.items():

if isinstance(value, dict):

print(f" {key}: Precision={value['precision']:.2f}, Recall={value['recall']:.2f}, F1={value['f1']:.2f}, Number={value['number']}")

else:

print(f" {key}: {value:.4f}")

except ModuleNotFoundError:

print("Seqeval metric not installed. Run: pip install seqeval")

import evaluate # Load the seqeval metric try: seqeval_metric = evaluate.load("seqeval") # Example labels (using IOB format) true_labels = [['O', 'B-PER', 'I-PER', 'O'], ['B-LOC', 'I-LOC', 'O']] predicted_labels = [['O', 'B-PER', 'I-PER', 'O'], ['B-LOC', 'I-LOC', 'O']] # Example: Perfect prediction here results = seqeval_metric.compute(predictions=predicted_labels, references=true_labels) print("Seqeval Results (per entity type):") # Print results nicely for key, value in results.items(): if isinstance(value, dict): print(f" {key}: Precision={value['precision']:.2f}, Recall={value['recall']:.2f}, F1={value['f1']:.2f}, Number={value['number']}") else: print(f" {key}: {value:.4f}") except ModuleNotFoundError: print("Seqeval metric not installed. Run: pip install seqeval")

import evaluate
# Load the seqeval metric
try:
   seqeval_metric = evaluate.load("seqeval")
   # Example labels (using IOB format)
   true_labels = [['O', 'B-PER', 'I-PER', 'O'], ['B-LOC', 'I-LOC', 'O']]
   predicted_labels = [['O', 'B-PER', 'I-PER', 'O'], ['B-LOC', 'I-LOC', 'O']] # Example: Perfect prediction here
   results = seqeval_metric.compute(predictions=predicted_labels, references=true_labels)
   print("Seqeval Results (per entity type):")
   # Print results nicely
   for key, value in results.items():
       if isinstance(value, dict):
           print(f"  {key}: Precision={value['precision']:.2f}, Recall={value['recall']:.2f}, F1={value['f1']:.2f}, Number={value['number']}")
       else:
           print(f"  {key}: {value:.4f}")
except ModuleNotFoundError:
   print("Seqeval metric not installed. Run: pip install seqeval")

輸出：

命名實體識別

解釋：

我們載入 seqeval 度量。
它使用列表的列表，其中每個內列表代表一個句子的標籤。
計算方法會針對識別出的每種實體型別（如 PER 表示人，LOC 表示位置）返回詳細的精確度、召回率和 F1 分數以及總分。

文字摘要 (ROUGE)

ROUGE（Recall-Oriented Understudy for Gisting Evaluation）將生成的摘要與參考摘要進行比較，重點關注重疊的 n-gram 和最長公共子序列。

import evaluate

def simple_summarizer(text):

"""A very basic summarizer - just takes the first sentence."""

try:

sentences = text.split(".")

return sentences[0].strip() + "." if sentences[0].strip() else ""

except:

return "" # Handle empty or malformed text

# Load ROUGE metric

rouge_metric = evaluate.load("rouge")

# Example text and reference summary

text = "Today is a beautiful day. The sun is shining and the birds are singing. I am going for a walk in the park."

reference = "The weather is pleasant today."

# Generate summary using the simple function

prediction = simple_summarizer(text)

print(f"Generated Summary: {prediction}")

print(f"Reference Summary: {reference}")

# Compute ROUGE scores

rouge_results = rouge_metric.compute(predictions=[prediction], references=[reference])

print(f"ROUGE Scores: {rouge_results}")

import evaluate def simple_summarizer(text): """A very basic summarizer - just takes the first sentence.""" try: sentences = text.split(".") return sentences[0].strip() + "." if sentences[0].strip() else "" except: return "" # Handle empty or malformed text # Load ROUGE metric rouge_metric = evaluate.load("rouge") # Example text and reference summary text = "Today is a beautiful day. The sun is shining and the birds are singing. I am going for a walk in the park." reference = "The weather is pleasant today." # Generate summary using the simple function prediction = simple_summarizer(text) print(f"Generated Summary: {prediction}") print(f"Reference Summary: {reference}") # Compute ROUGE scores rouge_results = rouge_metric.compute(predictions=[prediction], references=[reference]) print(f"ROUGE Scores: {rouge_results}")

import evaluate
def simple_summarizer(text):
   """A very basic summarizer - just takes the first sentence."""
   try:
       sentences = text.split(".")
       return sentences[0].strip() + "." if sentences[0].strip() else ""
   except:
       return "" # Handle empty or malformed text
# Load ROUGE metric
rouge_metric = evaluate.load("rouge")
# Example text and reference summary
text = "Today is a beautiful day. The sun is shining and the birds are singing. I am going for a walk in the park."
reference = "The weather is pleasant today."
# Generate summary using the simple function
prediction = simple_summarizer(text)
print(f"Generated Summary: {prediction}")
print(f"Reference Summary: {reference}")
# Compute ROUGE scores
rouge_results = rouge_metric.compute(predictions=[prediction], references=[reference])
print(f"ROUGE Scores: {rouge_results}")

輸出：

Generated Summary: Today is a beautiful day.Reference Summary: The weather is pleasant today.ROUGE Scores: {'rouge1': np.float64(0.4000000000000001), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.20000000000000004), 'rougeLsum': np.float64(0.20000000000000004)}

解釋：

我們載入 rouge 指標。
我們定義了一個簡單的總結器，用於演示。
compute 計算不同的 ROUGE 分數：
接近 1.0 的分數表示與參考摘要的相似度較高。這裡的低分反映了我們的 simple_summarizer 的基本性質。

問題解答 (SQuAD)

SQuAD 指標用於提取式問題解答基準。它計算精確匹配 (EM) 和 F1 分數。

import evaluate

# Load the SQuAD metric

squad_metric = evaluate.load("squad")

# Example predictions and references format for SQuAD

predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]

references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]

results = squad_metric.compute(predictions=predictions, references=references)

print(f"SQuAD Results: {results}")

import evaluate # Load the SQuAD metric squad_metric = evaluate.load("squad") # Example predictions and references format for SQuAD predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}] references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}] results = squad_metric.compute(predictions=predictions, references=references) print(f"SQuAD Results: {results}")

import evaluate
# Load the SQuAD metric
squad_metric = evaluate.load("squad")
# Example predictions and references format for SQuAD
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
print(f"SQuAD Results: {results}")

輸出：

問題解答 (SQuAD)

解釋：

載入 squad 指標。
以特定字典格式獲取預測結果和參考資訊，包括預測文字和基本真實答案及其起始位置。
精確匹配：與地面真實答案之一完全匹配的預測百分比。
f1：所有問題的平均 F1 分數，考慮標記級別的部分匹配。

使用評估器類進行高階評估

Evaluator 類整合了模型載入、推理和度量計算，從而簡化了流程。它對文字分類等標準任務特別有用。

# Note: Requires transformers and datasets libraries

# pip install transformers datasets torch # or tensorflow/jax

import evaluate

from evaluate import evaluator

from transformers import pipeline

from datasets import load_dataset

# Load a pre-trained text classification pipeline

# Using a smaller model for potentially faster execution

try:

pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=-1) # Use CPU

except Exception as e:

print(f"Could not load pipeline: {e}")

pipe = None

if pipe:

# Load a small subset of the IMDB dataset

try:

data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(100)) # Smaller subset for speed

except Exception as e:

print(f"Could not load dataset: {e}")

data = None

if data:

# Load the accuracy metric

accuracy_metric = evaluate.load("accuracy")

# Create an evaluator for the task

task_evaluator = evaluator("text-classification")

# Correct label_mapping for IMDB dataset

label_mapping = {

'NEGATIVE': 0, # Map NEGATIVE to 0

'POSITIVE': 1 # Map POSITIVE to 1

}

# Compute results

eval_results = task_evaluator.compute(

model_or_pipeline=pipe,

data=data,

metric=accuracy_metric,

input_column="text", # Specify the text column

label_column="label", # Specify the label column

label_mapping=label_mapping # Pass the corrected label mapping

)

print("\nEvaluator Results:")

print(eval_results)

# Compute with bootstrapping for confidence intervals

bootstrap_results = task_evaluator.compute(

model_or_pipeline=pipe,

data=data,

metric=accuracy_metric,

input_column="text",

label_column="label",

label_mapping=label_mapping, # Pass the corrected label mapping

strategy="bootstrap",

n_resamples=10 # Use fewer resamples for faster demo

)

print("\nEvaluator Results with Bootstrapping:")

print(bootstrap_results)

# Note: Requires transformers and datasets libraries # pip install transformers datasets torch # or tensorflow/jax import evaluate from evaluate import evaluator from transformers import pipeline from datasets import load_dataset # Load a pre-trained text classification pipeline # Using a smaller model for potentially faster execution try: pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=-1) # Use CPU except Exception as e: print(f"Could not load pipeline: {e}") pipe = None if pipe: # Load a small subset of the IMDB dataset try: data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(100)) # Smaller subset for speed except Exception as e: print(f"Could not load dataset: {e}") data = None if data: # Load the accuracy metric accuracy_metric = evaluate.load("accuracy") # Create an evaluator for the task task_evaluator = evaluator("text-classification") # Correct label_mapping for IMDB dataset label_mapping = { 'NEGATIVE': 0, # Map NEGATIVE to 0 'POSITIVE': 1 # Map POSITIVE to 1 } # Compute results eval_results = task_evaluator.compute( model_or_pipeline=pipe, data=data, metric=accuracy_metric, input_column="text", # Specify the text column label_column="label", # Specify the label column label_mapping=label_mapping # Pass the corrected label mapping ) print("\nEvaluator Results:") print(eval_results) # Compute with bootstrapping for confidence intervals bootstrap_results = task_evaluator.compute( model_or_pipeline=pipe, data=data, metric=accuracy_metric, input_column="text", label_column="label", label_mapping=label_mapping, # Pass the corrected label mapping strategy="bootstrap", n_resamples=10 # Use fewer resamples for faster demo ) print("\nEvaluator Results with Bootstrapping:") print(bootstrap_results)

# Note: Requires transformers and datasets libraries
# pip install transformers datasets torch # or tensorflow/jax
import evaluate
from evaluate import evaluator
from transformers import pipeline
from datasets import load_dataset
# Load a pre-trained text classification pipeline
# Using a smaller model for potentially faster execution
try:
   pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=-1) # Use CPU
except Exception as e:
   print(f"Could not load pipeline: {e}")
   pipe = None
if pipe:
   # Load a small subset of the IMDB dataset
   try:
       data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(100)) # Smaller subset for speed
   except Exception as e:
       print(f"Could not load dataset: {e}")
       data = None
   if data:
       # Load the accuracy metric
       accuracy_metric = evaluate.load("accuracy")
       # Create an evaluator for the task
       task_evaluator = evaluator("text-classification")
       # Correct label_mapping for IMDB dataset
       label_mapping = {
           'NEGATIVE': 0,  # Map NEGATIVE to 0
           'POSITIVE': 1   # Map POSITIVE to 1
       }
       # Compute results
       eval_results = task_evaluator.compute(
           model_or_pipeline=pipe,
           data=data,
           metric=accuracy_metric,
           input_column="text",  # Specify the text column
           label_column="label", # Specify the label column
           label_mapping=label_mapping  # Pass the corrected label mapping
       )
       print("\nEvaluator Results:")
       print(eval_results)
       # Compute with bootstrapping for confidence intervals
       bootstrap_results = task_evaluator.compute(
           model_or_pipeline=pipe,
           data=data,
           metric=accuracy_metric,
           input_column="text",
           label_column="label",
           label_mapping=label_mapping,  # Pass the corrected label mapping
           strategy="bootstrap",
           n_resamples=10  # Use fewer resamples for faster demo
       )
       print("\nEvaluator Results with Bootstrapping:")
       print(bootstrap_results)

輸出：

Device set to use cpuEvaluator Results:{'accuracy': 0.9, 'total_time_in_seconds': 24.277618517999997,'samples_per_second': 4.119020155368932, 'latency_in_seconds':0.24277618517999996}Evaluator Results with Bootstrapping:{'accuracy': {'confidence_interval': (np.float64(0.8703044820750653),np.float64(0.9335706530476571)), 'standard_error':np.float64(0.02412928142780514), 'score': 0.9}, 'total_time_in_seconds':23.871316319000016, 'samples_per_second': 4.189128017226537,'latency_in_seconds': 0.23871316319000013}

解釋：

我們載入了用於文字分類的轉換器管道和 IMDb 資料集樣本。
我們建立了一個專門用於“text-classification”的評估器。
計算方法負責向管道輸入資料（文字列）、獲取預測結果、使用指定指標將預測結果與真實標籤（標籤列）進行比較，並應用標籤對映。
該方法會返回指標得分以及總時間和每秒取樣次數等效能統計資料。
使用 strategy=“bootstrap ”會執行重取樣以估計指標的置信區間和標準誤差，從而瞭解得分的穩定性。

使用評估套件

評估套件捆綁了多個評估，通常以 GLUE 等特定基準為目標。這樣就可以針對一組標準任務執行模型。

# Note: Running a full suite can be computationally intensive and time-consuming.

# This example demonstrates the concept but might take a long time or require significant resources.

# It also installs multiple datasets and may require specific model configurations.

import evaluate

try:

print("\nLoading GLUE evaluation suite (this might download datasets)...")

# Load the GLUE task directly

# Using "mrpc" as an example task, but you can choose from the valid ones listed above

task = evaluate.load("glue", "mrpc") # Specify the task like "mrpc", "sst2", etc.

print("Task loaded.")

# You can now run the task on a model (for example: "distilbert-base-uncased")

# WARNING: This might take time for inference or fine-tuning.

# results = task.compute(model_or_pipeline="distilbert-base-uncased")

# print("\nEvaluation Results (MRPC Task):")

# print(results)

print("Skipping model inference for brevity in this example.")

print("Refer to Hugging Face documentation for full EvaluationSuite usage.")

except Exception as e:

print(f"Could not load or run evaluation suite: {e}")

# Note: Running a full suite can be computationally intensive and time-consuming. # This example demonstrates the concept but might take a long time or require significant resources. # It also installs multiple datasets and may require specific model configurations. import evaluate try: print("\nLoading GLUE evaluation suite (this might download datasets)...") # Load the GLUE task directly # Using "mrpc" as an example task, but you can choose from the valid ones listed above task = evaluate.load("glue", "mrpc") # Specify the task like "mrpc", "sst2", etc. print("Task loaded.") # You can now run the task on a model (for example: "distilbert-base-uncased") # WARNING: This might take time for inference or fine-tuning. # results = task.compute(model_or_pipeline="distilbert-base-uncased") # print("\nEvaluation Results (MRPC Task):") # print(results) print("Skipping model inference for brevity in this example.") print("Refer to Hugging Face documentation for full EvaluationSuite usage.") except Exception as e: print(f"Could not load or run evaluation suite: {e}")

# Note: Running a full suite can be computationally intensive and time-consuming.
# This example demonstrates the concept but might take a long time or require significant resources.
# It also installs multiple datasets and may require specific model configurations.
import evaluate
try:
   print("\nLoading GLUE evaluation suite (this might download datasets)...")
   # Load the GLUE task directly
   # Using "mrpc" as an example task, but you can choose from the valid ones listed above
   task = evaluate.load("glue", "mrpc")  # Specify the task like "mrpc", "sst2", etc.
   print("Task loaded.")
   # You can now run the task on a model (for example: "distilbert-base-uncased")
   # WARNING: This might take time for inference or fine-tuning.
   # results = task.compute(model_or_pipeline="distilbert-base-uncased")
   # print("\nEvaluation Results (MRPC Task):")
   # print(results)
   print("Skipping model inference for brevity in this example.")
   print("Refer to Hugging Face documentation for full EvaluationSuite usage.")
except Exception as e:
   print(f"Could not load or run evaluation suite: {e}")

輸出：

Loading GLUE evaluation suite (this might download datasets)...Task loaded.Skipping model inference for brevity in this example.Refer to Hugging Face documentation for full EvaluationSuite usage.

解釋：

EvaluationSuite.load 會載入一組預定義的評估任務（此處僅演示 GLUE 基準中的 MRPC 任務）。
suite.run(“model_name”) 命令通常會在套件中的每個資料集上執行模型，並計算相關指標。
輸出通常是一個字典列表，每個字典包含套件中一個任務的結果（注：執行此命令通常需要特定的環境設定和大量的計算時間）。

評估結果視覺化

視覺化有助於比較不同指標下的多個模型。雷達圖在這方面很有效。

import evaluate

import matplotlib.pyplot as plt # Ensure matplotlib is installed

from evaluate.visualization import radar_plot

# Sample data for multiple models across several metrics

# Lower latency is better, so we might invert it or consider it separately.

data = [

{"accuracy": 0.99, "precision": 0.80, "f1": 0.95, "latency_inv": 1/33.6},

{"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_inv": 1/11.2},

{"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_inv": 1/87.6},

{"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_inv": 1/101.6}

]

model_names = ["Model A", "Model B", "Model C", "Model D"]

# Generate the radar plot

# Higher values are generally better on a radar plot

try:

# Generate radar plot (ensure you pass a correct format and that data is valid)

plot = radar_plot(data=data, model_names=model_names)

# Display the plot

plt.show() # Explicitly show the plot, might be necessary in some environments

# To save the plot to a file (uncomment to use)

# plot.savefig("model_comparison_radar.png")

plt.close() # Close the plot window after showing/saving

except ImportError:

print("Visualization requires matplotlib. Run: pip install matplotlib")

except Exception as e:

print(f"Could not generate plot: {e}")

import evaluate import matplotlib.pyplot as plt # Ensure matplotlib is installed from evaluate.visualization import radar_plot # Sample data for multiple models across several metrics # Lower latency is better, so we might invert it or consider it separately. data = [ {"accuracy": 0.99, "precision": 0.80, "f1": 0.95, "latency_inv": 1/33.6}, {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_inv": 1/11.2}, {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_inv": 1/87.6}, {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_inv": 1/101.6} ] model_names = ["Model A", "Model B", "Model C", "Model D"] # Generate the radar plot # Higher values are generally better on a radar plot try: # Generate radar plot (ensure you pass a correct format and that data is valid) plot = radar_plot(data=data, model_names=model_names) # Display the plot plt.show() # Explicitly show the plot, might be necessary in some environments # To save the plot to a file (uncomment to use) # plot.savefig("model_comparison_radar.png") plt.close() # Close the plot window after showing/saving except ImportError: print("Visualization requires matplotlib. Run: pip install matplotlib") except Exception as e: print(f"Could not generate plot: {e}")

import evaluate
import matplotlib.pyplot as plt # Ensure matplotlib is installed
from evaluate.visualization import radar_plot
# Sample data for multiple models across several metrics
# Lower latency is better, so we might invert it or consider it separately.
data = [
   {"accuracy": 0.99, "precision": 0.80, "f1": 0.95, "latency_inv": 1/33.6},
   {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_inv": 1/11.2},
   {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_inv": 1/87.6},
   {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_inv": 1/101.6}
]
model_names = ["Model A", "Model B", "Model C", "Model D"]
# Generate the radar plot
# Higher values are generally better on a radar plot
try:
   # Generate radar plot (ensure you pass a correct format and that data is valid)
   plot = radar_plot(data=data, model_names=model_names)
   # Display the plot
   plt.show()  # Explicitly show the plot, might be necessary in some environments
   # To save the plot to a file (uncomment to use)
   # plot.savefig("model_comparison_radar.png")
   plt.close() # Close the plot window after showing/saving
except ImportError:
   print("Visualization requires matplotlib. Run: pip install matplotlib")
except Exception as e:
   print(f"Could not generate plot: {e}")

輸出：

評估結果視覺化

解釋：

我們準備了四個模型在準確度、精確度、F1 和反轉延遲（越高越好）方面的樣本結果。
radar_plot 建立了一個圖，每個軸代表一個指標，直觀地顯示了模型的比較情況。

儲存評估結果

您可以將評估結果儲存到檔案中（通常為 JSON 格式），以便儲存記錄或日後分析。

import evaluate

from pathlib import Path

# Perform an evaluation

accuracy_metric = evaluate.load("accuracy")

result = accuracy_metric.compute(references=[0, 1, 0, 1], predictions=[1, 0, 0, 1])

print(f"Result to save: {result}")

# Define hyperparameters or other metadata

hyperparams = {"model_name": "my_custom_model", "learning_rate": 0.001}

run_details = {"experiment_id": "run_42"}

# Combine results and metadata

save_data = {**result, **hyperparams, **run_details}

# Define save directory and filename

save_dir = Path("./evaluation_results")

save_dir.mkdir(exist_ok=True) # Create directory if it doesn't exist

# Use evaluate.save to store the results

try:

saved_path = evaluate.save(save_directory=save_dir, **save_data)

print(f"Results saved to: {saved_path}")

# You can also manually save as JSON

import json

manual_save_path = save_dir / "manual_results.json"

with open(manual_save_path, 'w') as f:

json.dump(save_data, f, indent=4)

print(f"Results manually saved to: {manual_save_path}")

except Exception as e:

# Catch potential git-related errors if run outside a repo

print(f"evaluate.save encountered an issue (possibly git related): {e}")

print("Attempting manual JSON save instead.")

import json

manual_save_path = save_dir / "manual_results_fallback.json"

with open(manual_save_path, 'w') as f:

json.dump(save_data, f, indent=4)

print(f"Results manually saved to: {manual_save_path}")

import evaluate from pathlib import Path # Perform an evaluation accuracy_metric = evaluate.load("accuracy") result = accuracy_metric.compute(references=[0, 1, 0, 1], predictions=[1, 0, 0, 1]) print(f"Result to save: {result}") # Define hyperparameters or other metadata hyperparams = {"model_name": "my_custom_model", "learning_rate": 0.001} run_details = {"experiment_id": "run_42"} # Combine results and metadata save_data = {**result, **hyperparams, **run_details} # Define save directory and filename save_dir = Path("./evaluation_results") save_dir.mkdir(exist_ok=True) # Create directory if it doesn't exist # Use evaluate.save to store the results try: saved_path = evaluate.save(save_directory=save_dir, **save_data) print(f"Results saved to: {saved_path}") # You can also manually save as JSON import json manual_save_path = save_dir / "manual_results.json" with open(manual_save_path, 'w') as f: json.dump(save_data, f, indent=4) print(f"Results manually saved to: {manual_save_path}") except Exception as e: # Catch potential git-related errors if run outside a repo print(f"evaluate.save encountered an issue (possibly git related): {e}") print("Attempting manual JSON save instead.") import json manual_save_path = save_dir / "manual_results_fallback.json" with open(manual_save_path, 'w') as f: json.dump(save_data, f, indent=4) print(f"Results manually saved to: {manual_save_path}")

import evaluate
from pathlib import Path
# Perform an evaluation
accuracy_metric = evaluate.load("accuracy")
result = accuracy_metric.compute(references=[0, 1, 0, 1], predictions=[1, 0, 0, 1])
print(f"Result to save: {result}")
# Define hyperparameters or other metadata
hyperparams = {"model_name": "my_custom_model", "learning_rate": 0.001}
run_details = {"experiment_id": "run_42"}
# Combine results and metadata
save_data = {**result, **hyperparams, **run_details}
# Define save directory and filename
save_dir = Path("./evaluation_results")
save_dir.mkdir(exist_ok=True) # Create directory if it doesn't exist
# Use evaluate.save to store the results
try:
   saved_path = evaluate.save(save_directory=save_dir, **save_data)
   print(f"Results saved to: {saved_path}")
   # You can also manually save as JSON
   import json
   manual_save_path = save_dir / "manual_results.json"
   with open(manual_save_path, 'w') as f:
       json.dump(save_data, f, indent=4)
   print(f"Results manually saved to: {manual_save_path}")
except Exception as e:
    # Catch potential git-related errors if run outside a repo
    print(f"evaluate.save encountered an issue (possibly git related): {e}")
    print("Attempting manual JSON save instead.")
    import json
    manual_save_path = save_dir / "manual_results_fallback.json"
    with open(manual_save_path, 'w') as f:
        json.dump(save_data, f, indent=4)
    print(f"Results manually saved to: {manual_save_path}")

輸出：

Result to save: {'accuracy': 0.5}evaluate.save encountered an issue (possibly git related): save() missing 1 required positional argument: 'path_or_file'Attempting manual JSON save instead.Results manually saved to: evaluation_results/manual_results_fallback.json

解釋：

我們會將計算出的結果字典與超引數等其他後設資料結合起來。
evaluate.save 會嘗試將這些資料儲存到指定目錄下的 JSON 檔案中。如果在版本庫中執行，它可能會嘗試新增git 提交資訊，否則會導致錯誤（如原始日誌所示）。
我們提供了手動將字典儲存為 JSON 檔案的備用方法，通常這樣就足夠了。

選擇正確的度量

選擇合適的度量標準至關重要。請考慮以下幾點：

任務型別：是分類、翻譯、摘要、NER 還是 QA？使用該任務的標準度量（分類使用 Accuracy/F1，生成使用 BLEU/ROUGE，NER 使用 Seqeval，QA 使用 SQuAD）。
資料集：有些基準（如 GLUE、SQuAD）有特定的相關指標。排行榜（例如，Papers With Code）通常會顯示特定資料集的常用指標。
目標：哪方面的效能最重要？
- 準確性：整體正確性（對平衡類有好處）。
- 精度/召回率/F1：對於不平衡類或誤報/負報的代價不同時非常重要。
- BLEU/ROUGE：文字生成中的流暢性和內容重疊。
- 複雜度：語言模型對樣本的預測程度（越低越好，常用於生成模型）。
度量卡：閱讀 Hugging Face 指標卡（文件），瞭解詳細解釋、限制和適當的使用案例（如 BLEU 卡、SQuAD 卡）。

小結

Hugging Face 評估庫為評估大型語言模型和資料集提供了一種多功能且使用者友好的方法。它提供了標準指標、資料集測量以及Evaluator和EvaluationSuite等工具來簡化流程。透過使用這些工具並選擇適合您任務的指標，您可以清楚地瞭解模型的優缺點。

有關詳細資訊和高階用法，請查閱官方資源：

Hugging Face Evaluate 文件：快速瀏覽
GitHub庫：huggingface/evaluate
Kaggle Notebook 示例： LLM 評估框架（此處使用的部分示例的來源）

Hugging Face 模型評估

如何使用Hugging Face Evaluate來評估LLM

瞭解Hugging Face評估庫

開始使用

安裝

載入評估模組

基本評估示例

直接計算準確度

增量評估（使用 add_batch）

組合多個指標

使用測量

評估特定的NLP任務

機器翻譯（BLEU）

命名實體識別（NER-使用seqeval）

文字摘要 (ROUGE)

問題解答 (SQuAD)

使用評估器類進行高階評估

使用評估套件

評估結果視覺化

儲存評估結果

選擇正確的度量

小結

評論留言

取消回覆

文章目录

如何使用Hugging Face Evaluate來評估LLM

瞭解Hugging Face評估庫

開始使用

安裝

載入評估模組

基本評估示例

直接計算準確度

增量評估（使用 add_batch）

組合多個指標

使用測量

評估特定的NLP任務

機器翻譯（BLEU）

命名實體識別（NER-使用seqeval）

文字摘要 (ROUGE)

問題解答 (SQuAD)

使用評估器類進行高階評估

使用評估套件

評估結果視覺化

儲存評估結果

選擇正確的度量

小結

相關文章

評論留言

取消回覆

文章目录