4种LLM压缩技术，使模型更小、更快

LLM压缩技术

谷歌和 OpenAI 等公司的 LLM 模型展现出了令人难以置信的能力。但它们的强大性能也伴随着成本。这些庞大的模型速度慢、运行成本高，并且难以部署到日常设备上。这正是 LLM 压缩技术的用武之地。这些方法可以压缩模型，使其运行速度更快、更易于访问，同时性能不会大幅下降。本指南探讨了四种关键技术：模型量化、模型剪枝方法、LLM 中的知识提炼以及低秩自适应 (LoRA)，并提供了实际的代码示例。

为什么我们需要LLM压缩？

在深入探讨“如何”之前，让我们先了解“为什么”。压缩 LLM 具有明显的优势，使其在实际应用中更具实用性。

减小模型大小：较小的模型需要更少的存储空间，使其更易于托管和分发。
更快的推理速度：紧凑的模型可以更快地生成响应。这可以提升聊天机器人等应用程序的用户体验。
更低的成本：减小模型大小和加快速度可以降低对内存和处理能力的需求。这可以减少云计算和能源成本。
更高的可访问性：压缩使强大的模型能够在资源有限的设备（例如智能手机和笔记本电脑）上运行。

技术 1：量化——事半功倍

模型量化是最流行且有效的 LLM 压缩技术之一。它的工作原理是降低构成模型的数字（权重）的精度。可以将其想象成将高分辨率照片保存为压缩的 JPEG 格式；虽然会丢失少量细节，但文件大小会大幅缩小。大多数模型使用 32 位浮点数 (FP32) 进行训练。量化会将这些浮点数转换为更小的 8 位整数 (INT8) 甚至 4 位整数。

模型量化

Source: Maartengrootendorst

此图直观地解释了量化过程，即将连续的高精度 FP32（32 位浮点）值映射到一组有限的离散低精度 INT4（4 位整数）值。本质上，它展示了如何将大量浮点数近似为较小且固定数量的整数级数，以减少内存和计算量，尽管这可能会造成一些精度损失。

动手实践：使用Hugging Face进行4位量化

让我们使用 Hugging Face 的 Transformer 和 bitsandbytes 库来量化一个模型。此示例展示了如何以 4 位精度加载模型，从而显著减少其内存占用。

步骤 1：安装库

首先，请确保已安装必要的库。

!pip install transformers torch accelerate bitsandbytes -q

步骤 2：加载并比较模型

我们将加载一个标准模型及其量化版本，以查看差异。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# We use a smaller, well-known model for this demonstration
model_id = "gpt2"
print(f"Loading tokenizer for model: {model_id}")
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("\n-----------------------------------")
print("Loading original model in FP32...")
# Load the original model in full precision (Float32)
model_fp32 = AutoModelForCausalLM.from_pretrained(model_id)
# Check the memory footprint of the original model
print("\nOriginal model memory footprint:")
# Calculate memory footprint manually
mem_fp32 = sum(p.numel() * p.element_size() for p in model_fp32.parameters())
print(f"{mem_fp32 / 1024**2:.2f} MB")
print("\n-----------------------------------")
print("Loading model with 4-bit quantization...")
# Load the same model with 4-bit quantization enabled
model_4bit = AutoModelForCausalLM.from_pretrained(
   model_id,
   load_in_4bit=True,
   device_map="auto" # Automatically uses the GPU if available
)
# Check the memory footprint of the 4-bit model
print("\n4-bit quantized model memory footprint:")
# Calculate memory footprint manually
mem_4bit = sum(p.numel() * p.element_size() for p in model_4bit.parameters())
print(f"{mem_4bit / 1024**2:.2f} MB")
print("\nNotice the significant reduction in memory usage!")

输出：

使用Hugging Face进行4位量化

您会注意到，模型的内存占用显著减少，而其输出质量在大多数任务中几乎没有变化。

技术 2：剪枝 – 修剪未使用的连接

模型剪枝方法的工作原理是移除神经网络中对输出贡献最小的部分。这就像修剪植物以促进其更健康地生长。您可以移除单个权重（非结构化剪枝）或整组神经元（结构化剪枝）。虽然剪枝功能强大，但正确实施起来却很复杂。

例如，非结构化剪枝会根据权重的大小移除单个权重，从而创建一个稀疏模型。虽然这会使模型更小，但硬件可能难以利用稀疏结构。结构化剪枝会移除整个块，例如神经元或层，这通常对硬件更友好。

剪枝 - 修剪未使用的连接

Source: Springer

该图展示了对视觉变换器 (ViT) 和大型语言模型 (LLM) 等组件进行剪枝的不同策略，这些组件使用“剪枝层”来缩减模型大小并提高效率。具体而言，(a) 展示了视觉编码器中的剪枝；(b) 重点关注 LLM 内部的剪枝；(c) 引入了“指令引导组件”，根据文本指令动态剪枝视觉标记，从而提升视频理解等任务的效率。

技术 3：知识蒸馏——学生-教师方法

LLM 中的知识蒸馏是一个引人入胜的过程。一个大型、高精度的“教师”模型训练一个较小的“学生”模型。学生学习模仿教师的思维过程（其输出概率），而不仅仅是最终答案。这使得较小的模型能够实现远超仅基于数据进行训练的性能。

知识蒸馏——学生-教师方法

Source: Britannica

此图展示了机器学习中的三种知识蒸馏方法：离线、在线和自蒸馏。离线蒸馏使用预先训练好的“老师”来训练“学生”，而在线蒸馏则同时训练两者，而自蒸馏则使用一个模型同时充当老师和学生（例如，较深的层教授较浅的层）。橙色的“老师”模型是预先训练好的，而蓝色的“学生”模型（包括自蒸馏中的“老师/学生”组合）是“待训练”的。

实践：使用Hugging Face进行概念蒸馏

实现完整的蒸馏流程需要一些时间，但可以通过 Hugging Face Trainer API 来理解其核心思想。

from transformers import TrainingArguments, Trainer
# This is a conceptual example to illustrate the process.
# To run this, you would need:
# 1. A defined 'teacher_model' (a large, pre-trained model).
# 2. A defined 'student_model' (a smaller model to be trained).
# 3. A 'your_dataset' object for training.
# Define Training Arguments
training_args = TrainingArguments(
   output_dir="./student_model_distilled",
   num_train_epochs=1, # Example value
   per_device_train_batch_size=8, # Example value
   # ... other training arguments
)
# Create a custom Trainer to modify the loss function
class DistillationTrainer(Trainer):
   def compute_loss(self, model, inputs, return_outputs=False):
       # This is the core of knowledge distillation.
       # The loss function is a weighted average of two components:
       #   a) The student's standard loss on the data (e.g., Cross-Entropy).
       #   b) The distillation loss, which measures how well the student's
       #      output distribution matches the teacher's.
       # This part is conceptual and requires a full implementation.
       print("Inside custom compute_loss - this is where distillation logic would go.")
       # For example:
       # student_outputs = model(**inputs)
       # student_loss = student_outputs.loss
       # with torch.no_grad():
       #     teacher_outputs = teacher_model(**inputs)
       # distillation_loss = some_kl_divergence_loss(student_outputs.logits, teacher_outputs.logits)
       # combined_loss = 0.5 * student_loss + 0.5 * distillation_loss
       # Returning a dummy loss to prevent errors in this conceptual example
       dummy_outputs = model(**inputs)
       return (dummy_outputs.loss, dummy_outputs) if return_outputs else dummy_outputs.loss
print("The DistillationTrainer class is defined conceptually.")
print("A full implementation would require a teacher model, student model, and a dataset.")

这个过程有效地将“知识”从大型模型迁移到较小的模型。

技术 4：低秩自适应 (LoRA) – 高效微调

低秩自适应 (LoRA) 虽然不是一种缩小基础模型的方法，但它是一种压缩微调过程中所做更改的技术。LoRA 不会重新训练模型中数十亿个参数，而是冻结原始模型并注入微小的可训练“适配器”层。这些适配器更小，使微调过程更快，并且生成的微调模型在存储和切换时也更节省内存。

低秩自适应 (LoRA) – 高效微调

Source: IBM

该图解释了 LoRA（低秩自适应）如何高效地进行模型微调：在训练过程中，一个小的、可训练的低秩自适应矩阵 (BA) 被添加到冻结的预训练权重 (W) 中。训练结束后，该低秩矩阵与原始权重合并，有效地创建一个专用模型 (W + BA)，而不会在部署期间增加推理延迟或内存占用。与完全微调相比，这显著减少了计算资源和存储需求。

实践：使用LoRA和PEFT进行微调

Hugging Face PEFT（参数高效微调）库使 LoRA 的应用变得简单。

步骤 1：安装库

!pip install peft -q

步骤 2：应用 LoRA 并比较参数计数

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
model_id = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_id)
# Define the LoRA configuration
lora_config = LoraConfig(
   task_type=TaskType.CAUSAL_LM, # Specify the task type
   r=8,  # Rank of the update matrices. Lower rank means fewer parameters.
   lora_alpha=32, # A scaling factor for the learned weights.
   lora_dropout=0.1, # Dropout probability for LoRA layers.
   target_modules=["c_attn"] # Apply LoRA to the attention layers of GPT-2.
)
# Wrap the base model with the LoRA adapters
lora_model = get_peft_model(model, lora_config)
print("--- Original Model ---")
# Get the total number of parameters for the original model
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
print("\n--- LoRA Adapted Model ---")
# The PeftModel object has the print_trainable_parameters method
lora_model.print_trainable_parameters()
print("\nNote how LoRA reduces trainable parameters by over 99%!")
print("This makes fine-tuning much more efficient.")

输出：

使用LoRA和PEFT进行微调

输出结果将显示需要训练和存储的参数数量大幅减少（通常超过 99%）。这使得我们可以针对各种任务微调和管理模型的多个不同版本，而无需为每个版本存储庞大的模型文件。

您可以在此处找到完整的 Colab 笔记本：Colab

小结

大型语言模型（LLM）将继续存在，但其庞大的规模带来了真正的挑战。LLM 压缩技术是释放其在更广泛应用领域潜力的关键。无论是简单的模型量化方法、手术般精准的模型剪枝方法、LLM 中知识提炼的巧妙指导，还是高效的低秩自适应 (LoRA)，这些方法都使 AI 更加实用。合适的技术取决于您的具体需求，但将它们结合起来通常可以带来最佳效果。

Google OpenAI 压缩技术

4种LLM压缩技术，使模型更小、更快

文章目录

为什么我们需要LLM压缩？

技术 1：量化——事半功倍

动手实践：使用Hugging Face进行4位量化

技术 2：剪枝 – 修剪未使用的连接

技术 3：知识蒸馏——学生-教师方法

实践：使用Hugging Face进行概念蒸馏

技术 4：低秩自适应 (LoRA) – 高效微调

实践：使用LoRA和PEFT进行微调

小结

评论留言

取消回复

4种LLM压缩技术，使模型更小、更快

文章目录

为什么我们需要LLM压缩？

技术 1：量化——事半功倍

动手实践：使用Hugging Face进行4位量化

技术 2：剪枝 – 修剪未使用的连接

技术 3：知识蒸馏——学生-教师方法

实践：使用Hugging Face进行概念蒸馏

技术 4：低秩自适应 (LoRA) – 高效微调

实践：使用LoRA和PEFT进行微调

小结

相关文章

评论留言

取消回复