用於LLM評估的困惑度指標（Perplexity Metric）

評估語言模型一直是一項具有挑戰性的任務。我們如何衡量一個模型是否真正理解語言、生成連貫的文字或產生準確的反應？在為此開發的各種指標中，困惑度指標（Perplexity Metric）是自然語言處理（NLP）和語言模型（LM）評估領域最基本、應用最廣泛的評估指標之一。

從統計語言建模的早期就開始使用困惑度，即使是在大型語言模型（LLM）時代，困惑度仍然具有重要意義。在本文中，我們將深入探討困惑度-它是什麼、如何工作、它的數學基礎、實現細節、優勢、侷限性以及它與其他評估指標的比較。

本文結束時，您將對困惑度指標有一個透徹的瞭解，並能自己實施它來評估語言模型。

什麼是Perplexity Metric？

Perplexity Metric（即困惑度指標）是對機率模型預測樣本好壞的一種測量。在語言模型中，困惑度量化了模型在遇到文字序列時的“驚訝”或“困惑”程度。困惑度越低，模型預測樣本文字的能力就越強。

低困惑度和高困惑度

更直觀地說：

低困惑度：該模型對預測序列中下一個詞是什麼很有信心，也很準確。
高困惑度：模型不確定，難以預測序列中的下一個單詞。

把困惑度看作是對問題的回答：“平均而言，根據該模型，該文字中每個單詞後面可能有多少個不同的單詞？一個完美的模型會給每個正確的單詞分配 1 的機率，從而得到 1 的困惑度（可能的最小值）。然而，真實的模型會將機率分佈到多個可能的單詞上，從而導致更高的困惑度。

快速檢查：如果一個語言模型在每一步都為 10 個可能的下一個詞分配相等的機率，那麼它的困惑度會是多少？(答案：正好 10）

困惑度如何工作？

Perplexity 的工作原理是測量語言模型對測試集的預測程度。這個過程包括：

在文字語料庫上訓練語言模型
在未見過的資料（測試集）上評估模型
計算模型認為測試資料的可能性有多大

其基本思想是，根據前面的單詞，使用模型為測試序列中的每個單詞分配一個機率。然後將這些機率結合起來，得出一個單一的困惑度得分。

例如，考慮句子“The cat sat on the mat”：

模型計算出 P(“cat” | “The”)
然後 P(“sat” | “The cat”)
然後 P(“on” | “The cat sat”)
以此類推

將這些機率組合起來，就得到了句子的總體可能性，然後將其轉換為困惑度。

困惑度是如何計算的？

讓我們來深入瞭解一下plexxity 背後的數學原理。對於語言模型來說，困惑度被定義為平均負對數似然的指數：

其中

$W$ 是測試序列 $(w_1, w_2, …, w_N)$
$N$ 是序列中的單詞數
$P(w_i|w_1, w_2, …, w_{i-1})$是考慮到前面所有單詞後，單詞 $w_i$ 的條件機率。

或者，如果我們使用機率鏈規則來表示序列的聯合機率，我們可以得到：

其中，$P(w_1, w_2, …, w_N)$ 是整個序列的聯合機率。

讓我們逐步分解這些公式：

根據上下文（前面的單詞）計算每個單詞的機率
對每個機率取對數（通常以 2 為底
求整個語序的這些對數機率的平均值
取平均值的負數（因為對數機率是負數）
最後，計算 2 的冪級數

得出的值就是困惑度得分。

試試看：考慮一個簡單的模型，該模型為“The cat sat”賦予 P(“the”)=0.2, P(“cat”)=0.1, P(“sat”)=0.05 的機率。計算這個序列的可解性。(我們將在實施部分展示解決方案）

困惑度指標的其他表示方法

1. 從熵的角度看困惑度

困惑度與資訊理論中的熵概念直接相關。如果我們用 $H$ 表示機率分佈的熵，那麼：

用 $H$ 表示機率分佈的熵

這種關係強調了“plexity”本質上是測量預測序列中下一個詞的平均不確定性。熵（不確定性）越高，困惑度就越高。

2. 作為乘法倒數的plexity

另一種理解“困惑度指標”的方法是將其理解為單詞機率幾何平均數的倒數：

將其理解為單詞機率幾何平均數的倒數

這種表述方式強調了困惑度與模型預測的可信度成反比。隨著模型的可信度提高（機率增加），困惑度也會降低。

用Python從頭開始實現困惑度指標

讓我們用 Python 來實現困惑度度計算，以鞏固我們的理解：

import numpy as np

from collections import Counter, defaultdict

class NgramLanguageModel:

def __init__(self, n=2):

self.n = n

self.context_counts = defaultdict(Counter)

self.context_totals = defaultdict(int)

def train(self, corpus):

"""Train the language model on a corpus"""

# Add start and end tokens

tokens = ['<s>'] * (self.n - 1) + corpus + ['</s>']

# Count n-grams

for i in range(len(tokens) - self.n + 1):

context = tuple(tokens[i:i+self.n-1])

word = tokens[i+self.n-1]

self.context_counts[context][word] += 1

self.context_totals[context] += 1

def probability(self, word, context):

"""Calculate probability of word given context"""

if self.context_totals[context] == 0:

return 1e-10 # Smoothing for unseen contexts

return (self.context_counts[context][word] + 1) / (self.context_totals[context] + len(self.context_counts))

def sequence_probability(self, sequence):

"""Calculate probability of entire sequence"""

tokens = ['<s>'] * (self.n - 1) + sequence + ['</s>']

prob = 1.0

for i in range(len(tokens) - self.n + 1):

context = tuple(tokens[i:i+self.n-1])

word = tokens[i+self.n-1]

prob *= self.probability(word, context)

return prob

def perplexity(self, test_sequence):

"""Calculate perplexity of a test sequence"""

N = len(test_sequence) + 1 # +1 for the end token

log_prob = 0.0

tokens = ['<s>'] * (self.n - 1) + test_sequence + ['</s>']

for i in range(len(tokens) - self.n + 1):

context = tuple(tokens[i:i+self.n-1])

word = tokens[i+self.n-1]

prob = self.probability(word, context)

log_prob += np.log2(prob)

return 2 ** (-log_prob / N)

# Let's test our implementation

def tokenize(text):

"""Simple tokenization by splitting on spaces"""

return text.lower().split()

# Example usage

corpus = tokenize("the cat sat on the mat the dog chased the cat the cat ran away")

test = tokenize("the cat sat on the floor")

model = NgramLanguageModel(n=2)

model.train(corpus)

print(f"Perplexity of test sequence: {model.perplexity(test):.2f}")

import numpy as np from collections import Counter, defaultdict class NgramLanguageModel: def __init__(self, n=2): self.n = n self.context_counts = defaultdict(Counter) self.context_totals = defaultdict(int) def train(self, corpus): """Train the language model on a corpus""" # Add start and end tokens tokens = ['<s>'] * (self.n - 1) + corpus + ['</s>'] # Count n-grams for i in range(len(tokens) - self.n + 1): context = tuple(tokens[i:i+self.n-1]) word = tokens[i+self.n-1] self.context_counts[context][word] += 1 self.context_totals[context] += 1 def probability(self, word, context): """Calculate probability of word given context""" if self.context_totals[context] == 0: return 1e-10 # Smoothing for unseen contexts return (self.context_counts[context][word] + 1) / (self.context_totals[context] + len(self.context_counts)) def sequence_probability(self, sequence): """Calculate probability of entire sequence""" tokens = ['<s>'] * (self.n - 1) + sequence + ['</s>'] prob = 1.0 for i in range(len(tokens) - self.n + 1): context = tuple(tokens[i:i+self.n-1]) word = tokens[i+self.n-1] prob *= self.probability(word, context) return prob def perplexity(self, test_sequence): """Calculate perplexity of a test sequence""" N = len(test_sequence) + 1 # +1 for the end token log_prob = 0.0 tokens = ['<s>'] * (self.n - 1) + test_sequence + ['</s>'] for i in range(len(tokens) - self.n + 1): context = tuple(tokens[i:i+self.n-1]) word = tokens[i+self.n-1] prob = self.probability(word, context) log_prob += np.log2(prob) return 2 ** (-log_prob / N) # Let's test our implementation def tokenize(text): """Simple tokenization by splitting on spaces""" return text.lower().split() # Example usage corpus = tokenize("the cat sat on the mat the dog chased the cat the cat ran away") test = tokenize("the cat sat on the floor") model = NgramLanguageModel(n=2) model.train(corpus) print(f"Perplexity of test sequence: {model.perplexity(test):.2f}")

import numpy as np
from collections import Counter, defaultdict
class NgramLanguageModel:
    def __init__(self, n=2):
        self.n = n
        self.context_counts = defaultdict(Counter)
        self.context_totals = defaultdict(int)
    def train(self, corpus):
        """Train the language model on a corpus"""
        # Add start and end tokens
        tokens = ['<s>'] * (self.n - 1) + corpus + ['</s>']
        # Count n-grams
        for i in range(len(tokens) - self.n + 1):
            context = tuple(tokens[i:i+self.n-1])
            word = tokens[i+self.n-1]
            self.context_counts[context][word] += 1
            self.context_totals[context] += 1
    def probability(self, word, context):
        """Calculate probability of word given context"""
        if self.context_totals[context] == 0:
            return 1e-10  # Smoothing for unseen contexts
        return (self.context_counts[context][word] + 1) / (self.context_totals[context] + len(self.context_counts))
    def sequence_probability(self, sequence):
        """Calculate probability of entire sequence"""
        tokens = ['<s>'] * (self.n - 1) + sequence + ['</s>']
        prob = 1.0
        for i in range(len(tokens) - self.n + 1):
            context = tuple(tokens[i:i+self.n-1])
            word = tokens[i+self.n-1]
            prob *= self.probability(word, context)
        return prob
    def perplexity(self, test_sequence):
        """Calculate perplexity of a test sequence"""
        N = len(test_sequence) + 1  # +1 for the end token
        log_prob = 0.0
        tokens = ['<s>'] * (self.n - 1) + test_sequence + ['</s>']
        for i in range(len(tokens) - self.n + 1):
            context = tuple(tokens[i:i+self.n-1])
            word = tokens[i+self.n-1]
            prob = self.probability(word, context)
            log_prob += np.log2(prob)
        return 2 ** (-log_prob / N)
# Let's test our implementation
def tokenize(text):
    """Simple tokenization by splitting on spaces"""
    return text.lower().split()
# Example usage
corpus = tokenize("the cat sat on the mat the dog chased the cat the cat ran away")
test = tokenize("the cat sat on the floor")
model = NgramLanguageModel(n=2)
model.train(corpus)
print(f"Perplexity of test sequence: {model.perplexity(test):.2f}")

該實現建立了一個基本的 n-gram 語言模型，並新增了一個平滑處理功能，用於處理未見過的單詞或上下文。讓我們來分析一下程式碼中發生了什麼：

我們定義了一個 NgramLanguageModel 類，用於儲存上下文和單詞的計數。
train 方法處理語料庫並建立 n-gram 統計資料。
機率方法透過基本平滑計算 P(word|context)。
sequence_probability 方法計算序列的聯合機率。
最後，perplexity 方法按照我們的公式計算perplexity。

輸出

Perplexity of test sequence: 129.42

示例和輸出

讓我們用我們的實現來執行一個完整的示例：

# Training corpus

train_corpus = tokenize("the cat sat on the mat the dog chased the cat the cat ran away")

# Test sequences

test_sequences = [

tokenize("the cat sat on the mat"),

tokenize("the dog sat on the floor"),

tokenize("a bird flew through the window")

]

# Train a bigram model

model = NgramLanguageModel(n=2)

model.train(train_corpus)

# Calculate perplexity for each test sequence

for i, test in enumerate(test_sequences):

ppl = model.perplexity(test)

print(f"Test sequence {i+1}: '{' '.join(test)}'")

print(f"Perplexity: {ppl:.2f}")

print()

# Training corpus train_corpus = tokenize("the cat sat on the mat the dog chased the cat the cat ran away") # Test sequences test_sequences = [ tokenize("the cat sat on the mat"), tokenize("the dog sat on the floor"), tokenize("a bird flew through the window") ] # Train a bigram model model = NgramLanguageModel(n=2) model.train(train_corpus) # Calculate perplexity for each test sequence for i, test in enumerate(test_sequences): ppl = model.perplexity(test) print(f"Test sequence {i+1}: '{' '.join(test)}'") print(f"Perplexity: {ppl:.2f}") print()

# Training corpus
train_corpus = tokenize("the cat sat on the mat the dog chased the cat the cat ran away")
# Test sequences
test_sequences = [
    tokenize("the cat sat on the mat"),
    tokenize("the dog sat on the floor"),
    tokenize("a bird flew through the window")
]
# Train a bigram model
model = NgramLanguageModel(n=2)
model.train(train_corpus)
# Calculate perplexity for each test sequence
for i, test in enumerate(test_sequences):
    ppl = model.perplexity(test)
    print(f"Test sequence {i+1}: '{' '.join(test)}'")
    print(f"Perplexity: {ppl:.2f}")
    print()

輸出

Test sequence 1: 'the cat sat on the mat'Perplexity: 6.15Test sequence 2: 'the dog sat on the floor'Perplexity: 154.05Test sequence 3: 'a bird flew through the window'Perplexity: 28816455.70

請注意，當我們從測試序列 1（在訓練資料中逐字出現）移動到序列 3（包含許多訓練中未出現的單詞）時，複雜度是如何增加的。這說明了困惑度如何反映了模型的不確定性。

在NLTK中實現困惑度指標

在實際應用中，您可能希望使用像 NLTK 這樣的成熟庫，它們提供了更復雜的語言模型實現和困惑度計算：

import nltk

from nltk.lm import Laplace

from nltk.lm.preprocessing import padded_everygram_pipeline

from nltk.tokenize import word_tokenize

import math

# Download required resources

nltk.download('punkt')

# Prepare the training data

train_text = "The cat sat on the mat. The dog chased the cat. The cat ran away."

train_tokens = [word_tokenize(train_text.lower())]

# Create n-grams and vocabulary

n = 2 # Bigram model

train_data, padded_vocab = padded_everygram_pipeline(n, train_tokens)

# Train the model using Laplace smoothing

model = Laplace(n) # Laplace (add-1) smoothing to handle unseen words

model.fit(train_data, padded_vocab)

# Test sentence

test_text = "The cat sat on the floor."

test_tokens = word_tokenize(test_text.lower())

# Prepare test data with padding

test_data = list(nltk.ngrams(test_tokens, n, pad_left=True, pad_right=True,

left_pad_symbol='<s>', right_pad_symbol='</s>'))

# Compute perplexity manually

log_prob_sum = 0

N = len(test_data)

for ngram in test_data:

prob = model.score(ngram[-1], ngram[:-1]) # P(w_i | w_{i-1})

log_prob_sum += math.log2(prob) # Avoid log(0) due to smoothing

# Compute final perplexity

perplexity = 2 ** (-log_prob_sum / N)

print(f"Perplexity (Laplace smoothing): {perplexity:.2f}")

import nltk from nltk.lm import Laplace from nltk.lm.preprocessing import padded_everygram_pipeline from nltk.tokenize import word_tokenize import math # Download required resources nltk.download('punkt') # Prepare the training data train_text = "The cat sat on the mat. The dog chased the cat. The cat ran away." train_tokens = [word_tokenize(train_text.lower())] # Create n-grams and vocabulary n = 2 # Bigram model train_data, padded_vocab = padded_everygram_pipeline(n, train_tokens) # Train the model using Laplace smoothing model = Laplace(n) # Laplace (add-1) smoothing to handle unseen words model.fit(train_data, padded_vocab) # Test sentence test_text = "The cat sat on the floor." test_tokens = word_tokenize(test_text.lower()) # Prepare test data with padding test_data = list(nltk.ngrams(test_tokens, n, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')) # Compute perplexity manually log_prob_sum = 0 N = len(test_data) for ngram in test_data: prob = model.score(ngram[-1], ngram[:-1]) # P(w_i | w_{i-1}) log_prob_sum += math.log2(prob) # Avoid log(0) due to smoothing # Compute final perplexity perplexity = 2 ** (-log_prob_sum / N) print(f"Perplexity (Laplace smoothing): {perplexity:.2f}")

import nltk
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import word_tokenize
import math
# Download required resources
nltk.download('punkt')
# Prepare the training data
train_text = "The cat sat on the mat. The dog chased the cat. The cat ran away."
train_tokens = [word_tokenize(train_text.lower())]
# Create n-grams and vocabulary
n = 2  # Bigram model
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokens)
# Train the model using Laplace smoothing
model = Laplace(n)  # Laplace (add-1) smoothing to handle unseen words
model.fit(train_data, padded_vocab)
# Test sentence
test_text = "The cat sat on the floor."
test_tokens = word_tokenize(test_text.lower())
# Prepare test data with padding
test_data = list(nltk.ngrams(test_tokens, n, pad_left=True, pad_right=True,
                             left_pad_symbol='<s>', right_pad_symbol='</s>'))
# Compute perplexity manually
log_prob_sum = 0
N = len(test_data)
for ngram in test_data:
   prob = model.score(ngram[-1], ngram[:-1])  # P(w_i | w_{i-1})
   log_prob_sum += math.log2(prob)  # Avoid log(0) due to smoothing
# Compute final perplexity
perplexity = 2 ** (-log_prob_sum / N)
print(f"Perplexity (Laplace smoothing): {perplexity:.2f}")

Output: Perplexity (Laplace smoothing): 8.33

在自然語言處理（NLP）中，plexity 衡量語言模型預測詞序列的能力。困惑度得分越低，說明模型越好。然而，最大似然估計（MLE）模型存在詞彙外（OOV）問題，即對未見詞彙的機率為零，從而導致無限的困惑度。

為了解決這個問題，我們使用了拉普拉斯平滑法（Add-1 平滑法），它為未見詞賦予較小的機率，從而避免了零機率。修正後的程式碼使用 NLTK 的拉普拉斯類而不是 MLE 實現了一個 bigram 語言模型。這樣，即使測試句包含了訓練中未出現的單詞，也能確保有限的困惑度得分。

這項技術對於為文字預測和語音識別建立穩健的 n-gram 模型至關重要。

困惑度指標的優勢

作為語言模型的評估指標，plexity 具有以下幾個優點：

可解釋性：Perplexity 可明確解釋為預測任務的平均分支因子。
與模型無關：它可應用於任何為序列分配機率的機率語言模型。
無需人工註釋：與許多其他評估指標不同，perplexity 不需要人類註釋的參考文字。
效率：它的計算效率很高，尤其是與需要生成或取樣的指標相比。
歷史先例：作為語言建模領域歷史最悠久的指標之一，困惑度擁有成熟的基準和豐富的研究歷史。
可直接比較：具有相同詞彙量的模型可以根據其困惑度得分進行直接比較。

困惑度指標的侷限性

儘管 perplexity 被廣泛使用，但它仍有幾個重要的侷限性：

詞彙依賴性：困惑度得分只能在使用相同詞彙的模型之間進行比較。
與人類判斷不一致：在人類評估中，較低的困惑度並不總能轉化為較高的質量。
僅限於開放式生成：困惑度評估的是模型預測特定文字的能力，而不是生成的文字的連貫性、多樣性或趣味性。
無法理解語義：一個模型可以透過記憶 n-grams 來實現低困惑度，而不需要真正的理解。
與任務無關：困惑度不能衡量特定任務的效能（如問題解答、總結）。
長距離依賴性問題：傳統的perplexity實現方法在評估文字中的長距離依賴關係方面存在困難。

利用LLM-as-a-Judge克服侷限性

為了解決困惑度的侷限性，研究人員開發了其他評估方法，包括使用大型語言模型作為法官（LLM-as-a-Judge）：

原理：使用功能更強大的 LLM 來評估另一個語言模型的輸出結果。
實施：
- 使用被評估的模型生成文字
- 將文字連同評估標準一起提供給“法官”LLM
- 讓評判 LLM 對生成的文字進行評分或排序
優點：
- 可對連貫性、事實性和相關性等方面進行評估
- 更符合人類的判斷
- 可針對特定的評價標準進行定製
實施示例 ：

def llm_as_judge(generated_text, reference_text=None, criteria="coherence and fluency"):

"""Use a large language model to judge generated text"""

# This is a simplified example - in practice, you'd call an actual LLM API

if reference_text:

prompt = f"""

Please evaluate the following generated text based on {criteria}.

Reference text: {reference_text}

Generated text: {generated_text}

Score from 1-10 and provide reasoning.

"""

else:

prompt = f"""

Please evaluate the following generated text based on {criteria}.

Generated text: {generated_text}

Score from 1-10 and provide reasoning.

"""

# In a real implementation, you would call your LLM API here

# response = llm_api.generate(prompt)

# return parse_score(response)

# For demonstration purposes only:

import random

score = random.uniform(1, 10)

return score

def llm_as_judge(generated_text, reference_text=None, criteria="coherence and fluency"): """Use a large language model to judge generated text""" # This is a simplified example - in practice, you'd call an actual LLM API if reference_text: prompt = f""" Please evaluate the following generated text based on {criteria}. Reference text: {reference_text} Generated text: {generated_text} Score from 1-10 and provide reasoning. """ else: prompt = f""" Please evaluate the following generated text based on {criteria}. Generated text: {generated_text} Score from 1-10 and provide reasoning. """ # In a real implementation, you would call your LLM API here # response = llm_api.generate(prompt) # return parse_score(response) # For demonstration purposes only: import random score = random.uniform(1, 10) return score

def llm_as_judge(generated_text, reference_text=None, criteria="coherence and fluency"):
    """Use a large language model to judge generated text"""
    # This is a simplified example - in practice, you'd call an actual LLM API
    if reference_text:
        prompt = f"""
        Please evaluate the following generated text based on {criteria}.
        Reference text: {reference_text}
        Generated text: {generated_text}
        Score from 1-10 and provide reasoning.
        """
    else:
        prompt = f"""
        Please evaluate the following generated text based on {criteria}.
        Generated text: {generated_text}
        Score from 1-10 and provide reasoning.
        """
    # In a real implementation, you would call your LLM API here
    # response = llm_api.generate(prompt)
    # return parse_score(response)
    # For demonstration purposes only:
    import random
    score = random.uniform(1, 10)
    return score

這種方法透過在多個維度上對文字質量進行類似於人類的判斷，從而補充了困惑度。

實際應用

Perplexity 在各種 NLP 任務中都有應用：

語言模型評估：比較不同的 LM 架構或超引數設定。
領域適應：衡量模型對特定領域的適應程度。
失配檢測（Out-of-Distribution Detection）：識別與訓練分佈不匹配的文字。
資料質量評估：評估訓練或測試資料的質量。
文字生成過濾：使用困惑度過濾掉低質量的生成文字。
異常檢測：識別不尋常或異常文字模式。

Perplexity 在各種 NLP 任務中都有應用

與其他LLM評估指標的比較

讓我們將困惑度與其他流行的語言模型評估指標進行比較：

指標	衡量標準	優勢	侷限
Perplexity	預測準確率	無需參考資料，高效	依賴詞彙，與人的判斷不一致
BLEU	N-gram 與參考文獻的重疊率	適合翻譯、摘要	需要參考，創造性差
ROUGE	參考文獻中 N 片語的召回率	適合摘要	需要參考，側重於重疊
BERTScore	使用上下文嵌入的語義相似性	更好地理解語義	計算密集
Human Evaluation	人類判斷的各個方面	質量最可靠	昂貴、耗時、主觀
LLM-as-Judge	由 LLM 判斷的各個方面	靈活、可擴充套件	取決於判斷模型的質量

流行的語言模型評估指標

要選擇正確的度量標準，請考慮

任務：您要評估語言生成的哪個方面？
是否有參考資料：是否有參考文字？
計算資源：評估需要多高效？
可解釋性：理解指標有多重要？

混合方法通常效果最佳-既能提高效率，又能結合其他指標進行綜合評估。

小結

長期以來，“困惑度指標”一直是評估語言模型的關鍵指標，它提供了一個清晰的、資訊理論的指標來衡量模型預測文字的能力。儘管它有一些侷限性，比如與人類判斷的一致性較差，但當它與更新的方法（如基於參考的分數、嵌入相似性和基於 LLM 的評估）相結合時，仍然非常有用。

隨著模型越來越先進，評估很可能會轉向混合方法，將perplexity的效率與更多與人類匹配的指標結合起來。

底線：將困惑度視為眾多訊號中的一個，同時瞭解其優勢和盲點。

對您的挑戰：嘗試在自己的文字語料庫中進行困惑度計算！以本文提供的程式碼為起點，嘗試使用不同的 n-gram 大小、平滑技術和測試集。改變這些引數對困惑度得分有何影響？

Perplexity 困惑度模型評估

用於LLM評估的困惑度指標（Perplexity Metric）

什麼是Perplexity Metric？

困惑度如何工作？

困惑度是如何計算的？

困惑度指標的其他表示方法

1. 從熵的角度看困惑度

2. 作為乘法倒數的plexity

用Python從頭開始實現困惑度指標

輸出

示例和輸出

輸出

在NLTK中實現困惑度指標

困惑度指標的優勢

困惑度指標的侷限性

利用LLM-as-a-Judge克服侷限性

實際應用

與其他LLM評估指標的比較

小結

評論留言

取消回覆

文章目录

用於LLM評估的困惑度指標（Perplexity Metric）

什麼是Perplexity Metric？

困惑度如何工作？

困惑度是如何計算的？

困惑度指標的其他表示方法

1. 從熵的角度看困惑度

2. 作為乘法倒數的plexity

用Python從頭開始實現困惑度指標

輸出

示例和輸出

輸出

在NLTK中實現困惑度指標

困惑度指標的優勢

困惑度指標的侷限性

利用LLM-as-a-Judge克服侷限性

實際應用

與其他LLM評估指標的比較

小結

相關文章

評論留言

取消回覆

文章目录