4種LLM壓縮技術，使模型更小、更快

LLM壓縮技術

谷歌和 OpenAI 等公司的 LLM 模型展現出了令人難以置信的能力。但它們的強大效能也伴隨著成本。這些龐大的模型速度慢、執行成本高，並且難以部署到日常裝置上。這正是 LLM 壓縮技術的用武之地。這些方法可以壓縮模型，使其執行速度更快、更易於訪問，同時效能不會大幅下降。本指南探討了四種關鍵技術：模型量化、模型剪枝方法、LLM 中的知識提煉以及低秩自適應 (LoRA)，並提供了實際的程式碼示例。

為什麼我們需要LLM壓縮？

在深入探討“如何”之前，讓我們先了解“為什麼”。壓縮 LLM 具有明顯的優勢，使其在實際應用中更具實用性。

減小模型大小：較小的模型需要更少的儲存空間，使其更易於託管和分發。
更快的推理速度：緊湊的模型可以更快地生成響應。這可以提升聊天機器人等應用程式的使用者體驗。
更低的成本：減小模型大小和加快速度可以降低對記憶體和處理能力的需求。這可以減少雲端計算和能源成本。
更高的可訪問性：壓縮使強大的模型能夠在資源有限的裝置（例如智慧手機和筆記型電腦）上執行。

技術 1：量化——事半功倍

模型量化是最流行且有效的 LLM 壓縮技術之一。它的工作原理是降低構成模型的數字（權重）的精度。可以將其想象成將高解析度照片儲存為壓縮的 JPEG 格式；雖然會丟失少量細節，但檔案大小會大幅縮小。大多數模型使用 32 位浮點數 (FP32) 進行訓練。量化會將這些浮點數轉換為更小的 8 位整數 (INT8) 甚至 4 位整數。

模型量化

Source: Maartengrootendorst

此圖直觀地解釋了量化過程，即將連續的高精度 FP32（32 位浮點）值對映到一組有限的離散低精度 INT4（4 位整數）值。本質上，它展示瞭如何將大量浮點數近似為較小且固定數量的整數級數，以減少記憶體和計算量，儘管這可能會造成一些精度損失。

動手實踐：使用Hugging Face進行4位量化

讓我們使用 Hugging Face 的 Transformer 和 bitsandbytes 庫來量化一個模型。此示例展示瞭如何以 4 位精度載入模型，從而顯著減少其記憶體佔用。

步驟 1：安裝庫

首先，請確保已安裝必要的庫。

!pip install transformers torch accelerate bitsandbytes -q

步驟 2：載入並比較模型

我們將載入一個標準模型及其量化版本，以檢視差異。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# We use a smaller, well-known model for this demonstration
model_id = "gpt2"
print(f"Loading tokenizer for model: {model_id}")
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("\n-----------------------------------")
print("Loading original model in FP32...")
# Load the original model in full precision (Float32)
model_fp32 = AutoModelForCausalLM.from_pretrained(model_id)
# Check the memory footprint of the original model
print("\nOriginal model memory footprint:")
# Calculate memory footprint manually
mem_fp32 = sum(p.numel() * p.element_size() for p in model_fp32.parameters())
print(f"{mem_fp32 / 1024**2:.2f} MB")
print("\n-----------------------------------")
print("Loading model with 4-bit quantization...")
# Load the same model with 4-bit quantization enabled
model_4bit = AutoModelForCausalLM.from_pretrained(
   model_id,
   load_in_4bit=True,
   device_map="auto" # Automatically uses the GPU if available
)
# Check the memory footprint of the 4-bit model
print("\n4-bit quantized model memory footprint:")
# Calculate memory footprint manually
mem_4bit = sum(p.numel() * p.element_size() for p in model_4bit.parameters())
print(f"{mem_4bit / 1024**2:.2f} MB")
print("\nNotice the significant reduction in memory usage!")

輸出：

使用Hugging Face進行4位量化

您會注意到，模型的記憶體佔用顯著減少，而其輸出質量在大多數任務中幾乎沒有變化。

技術 2：剪枝 – 修剪未使用的連線

模型剪枝方法的工作原理是移除神經網路中對輸出貢獻最小的部分。這就像修剪植物以促進其更健康地生長。您可以移除單個權重（非結構化剪枝）或整組神經元（結構化剪枝）。雖然剪枝功能強大，但正確實施起來卻很複雜。

例如，非結構化剪枝會根據權重的大小移除單個權重，從而建立一個稀疏模型。雖然這會使模型更小，但硬體可能難以利用稀疏結構。結構化剪枝會移除整個塊，例如神經元或層，這通常對硬體更友好。

剪枝 - 修剪未使用的連線

Source: Springer

該圖展示了對視覺變換器 (ViT) 和大型語言模型 (LLM) 等元件進行剪枝的不同策略，這些元件使用“剪枝層”來縮減模型大小並提高效率。具體而言，(a) 展示了視覺編碼器中的剪枝；(b) 重點關注 LLM 內部的剪枝；(c) 引入了“指令引導元件”，根據文字指令動態剪枝視覺標記，從而提升影片理解等任務的效率。

技術 3：知識蒸餾——學生-教師方法

LLM 中的知識蒸餾是一個引人入勝的過程。一個大型、高精度的“教師”模型訓練一個較小的“學生”模型。學生學習模仿教師的思維過程（其輸出機率），而不僅僅是最終答案。這使得較小的模型能夠實現遠超僅基於資料進行訓練的效能。

知識蒸餾——學生-教師方法

Source: Britannica

此圖展示了機器學習中的三種知識蒸餾方法：離線、線上和自蒸餾。離線蒸餾使用預先訓練好的“老師”來訓練“學生”，而線上蒸餾則同時訓練兩者，而自蒸餾則使用一個模型同時充當老師和學生（例如，較深的層教授較淺的層）。橙色的“老師”模型是預先訓練好的，而藍色的“學生”模型（包括自蒸餾中的“老師/學生”組合）是“待訓練”的。

實踐：使用Hugging Face進行概念蒸餾

實現完整的蒸餾流程需要一些時間，但可以透過 Hugging Face Trainer API 來理解其核心思想。

from transformers import TrainingArguments, Trainer
# This is a conceptual example to illustrate the process.
# To run this, you would need:
# 1. A defined 'teacher_model' (a large, pre-trained model).
# 2. A defined 'student_model' (a smaller model to be trained).
# 3. A 'your_dataset' object for training.
# Define Training Arguments
training_args = TrainingArguments(
   output_dir="./student_model_distilled",
   num_train_epochs=1, # Example value
   per_device_train_batch_size=8, # Example value
   # ... other training arguments
)
# Create a custom Trainer to modify the loss function
class DistillationTrainer(Trainer):
   def compute_loss(self, model, inputs, return_outputs=False):
       # This is the core of knowledge distillation.
       # The loss function is a weighted average of two components:
       #   a) The student's standard loss on the data (e.g., Cross-Entropy).
       #   b) The distillation loss, which measures how well the student's
       #      output distribution matches the teacher's.
       # This part is conceptual and requires a full implementation.
       print("Inside custom compute_loss - this is where distillation logic would go.")
       # For example:
       # student_outputs = model(**inputs)
       # student_loss = student_outputs.loss
       # with torch.no_grad():
       #     teacher_outputs = teacher_model(**inputs)
       # distillation_loss = some_kl_divergence_loss(student_outputs.logits, teacher_outputs.logits)
       # combined_loss = 0.5 * student_loss + 0.5 * distillation_loss
       # Returning a dummy loss to prevent errors in this conceptual example
       dummy_outputs = model(**inputs)
       return (dummy_outputs.loss, dummy_outputs) if return_outputs else dummy_outputs.loss
print("The DistillationTrainer class is defined conceptually.")
print("A full implementation would require a teacher model, student model, and a dataset.")

這個過程有效地將“知識”從大型模型遷移到較小的模型。

技術 4：低秩自適應 (LoRA) – 高效微調

低秩自適應 (LoRA) 雖然不是一種縮小基礎模型的方法，但它是一種壓縮微調過程中所做更改的技術。LoRA 不會重新訓練模型中數十億個引數，而是凍結原始模型並注入微小的可訓練“介面卡”層。這些介面卡更小，使微調過程更快，並且生成的微調模型在儲存和切換時也更節省記憶體。

低秩自適應 (LoRA) – 高效微調

Source: IBM

該圖解釋了 LoRA（低秩自適應）如何高效地進行模型微調：在訓練過程中，一個小的、可訓練的低秩自適應矩陣 (BA) 被新增到凍結的預訓練權重 (W) 中。訓練結束後，該低秩矩陣與原始權重合並，有效地建立一個專用模型 (W + BA)，而不會在部署期間增加推理延遲或記憶體佔用。與完全微調相比，這顯著減少了計算資源和儲存需求。

實踐：使用LoRA和PEFT進行微調

Hugging Face PEFT（引數高效微調）庫使 LoRA 的應用變得簡單。

步驟 1：安裝庫

!pip install peft -q

步驟 2：應用 LoRA 並比較引數計數

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
model_id = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_id)
# Define the LoRA configuration
lora_config = LoraConfig(
   task_type=TaskType.CAUSAL_LM, # Specify the task type
   r=8,  # Rank of the update matrices. Lower rank means fewer parameters.
   lora_alpha=32, # A scaling factor for the learned weights.
   lora_dropout=0.1, # Dropout probability for LoRA layers.
   target_modules=["c_attn"] # Apply LoRA to the attention layers of GPT-2.
)
# Wrap the base model with the LoRA adapters
lora_model = get_peft_model(model, lora_config)
print("--- Original Model ---")
# Get the total number of parameters for the original model
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
print("\n--- LoRA Adapted Model ---")
# The PeftModel object has the print_trainable_parameters method
lora_model.print_trainable_parameters()
print("\nNote how LoRA reduces trainable parameters by over 99%!")
print("This makes fine-tuning much more efficient.")

輸出：

使用LoRA和PEFT進行微調

輸出結果將顯示需要訓練和儲存的引數數量大幅減少（通常超過 99%）。這使得我們可以針對各種任務微調和管理模型的多個不同版本，而無需為每個版本儲存龐大的模型檔案。

您可以在此處找到完整的 Colab 筆記本：Colab

小結

大型語言模型（LLM）將繼續存在，但其龐大的規模帶來了真正的挑戰。LLM 壓縮技術是釋放其在更廣泛應用領域潛力的關鍵。無論是簡單的模型量化方法、手術般精準的模型剪枝方法、LLM 中知識提煉的巧妙指導，還是高效的低秩自適應 (LoRA)，這些方法都使 AI 更加實用。合適的技術取決於您的具體需求，但將它們結合起來通常可以帶來最佳效果。

Google OpenAI 壓縮技術

4種LLM壓縮技術，使模型更小、更快

文章目录

為什麼我們需要LLM壓縮？

技術 1：量化——事半功倍

動手實踐：使用Hugging Face進行4位量化

技術 2：剪枝 – 修剪未使用的連線

技術 3：知識蒸餾——學生-教師方法

實踐：使用Hugging Face進行概念蒸餾

技術 4：低秩自適應 (LoRA) – 高效微調

實踐：使用LoRA和PEFT進行微調

小結

評論留言

取消回覆

4種LLM壓縮技術，使模型更小、更快

文章目录

為什麼我們需要LLM壓縮？

技術 1：量化——事半功倍

動手實踐：使用Hugging Face進行4位量化

技術 2：剪枝 – 修剪未使用的連線

技術 3：知識蒸餾——學生-教師方法

實踐：使用Hugging Face進行概念蒸餾

技術 4：低秩自適應 (LoRA) – 高效微調

實踐：使用LoRA和PEFT進行微調

小結

相關文章

評論留言

取消回覆