微软Phi-4多模态实践指南

微软正式扩展了其 Phi-4 系列，推出了 Phi-4-mini-instruct (3.8B) 和 Phi-4-multimodal (5.6B)，对之前发布的以高级推理能力著称的 Phi-4 (14B) 型号进行了补充。这些新增功能大大增强了多语言支持、推理和数学技能，并引入了多模态功能。

这种轻量级开放式多模态模型集成了文本、视觉和语音处理功能，可在不同数据格式之间进行无缝交互。Phi-4 Multimodal 拥有 128K 标记上下文长度和 5.6B 参数，是一款功能强大的工具，可在设备上执行和低延迟推理。

在本文中，我们将深入探讨 Phi-4-multimodal，这是一种最先进的多模态小语言模型（SLM），能够处理文本、视觉和音频输入。我们还将探讨实际的动手实现，帮助开发人员将生成式人工智能集成到现实世界的应用中。

Phi-4多模态：人工智能发展的飞跃

Phi-4多模态

Source: Phi

Phi-4多模态的主要特点

Phi-4 多模态是一种尖端的人工智能模型，旨在处理多种输入类型。以下是它的独特之处：

统一多模态处理：与需要为不同输入类型建立独立管道的传统模型不同，Phi-4 利用其低秩适配器（LoRAs）混合物，将语音、视觉和文本整合到一个处理空间中。
先进的训练技术：该模型经过了监督微调、直接偏好优化（DPO）和人类反馈强化学习（RLHF），确保了高准确性和安全输出。
多语言支持：文本处理支持 22 种语言，而视觉和音频功能则增强了对全球主要语言的理解。
优化效率：专为在设备上执行而设计，Phi-4 可最大限度地减少计算开销，同时保持最先进的性能。

支持的模式和语言

Phi-4 Multimodal 可处理文本、视觉和音频输入，因此用途非常广泛。以下是每种模式的语言支持明细：

模态	支持语言
Text	Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Vision	English
Audio	English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Phi-4多模态的结构进步

1. 统一表示空间

Phi-4 的 LoRAs 混合架构允许同时处理语音、视觉和文本。与需要不同子模式的早期模型不同，Phi-4 在同一框架内处理所有输入，大大提高了效率和一致性。

2. 可扩展性和效率

针对低延迟推理进行了优化，非常适合移动和边缘计算应用。
支持更大的词汇集，增强了跨多模态输入的语言推理能力。
采用更小但功能强大的参数化（5.6B 个参数），可在不影响性能的情况下进行高效部署。

3. 改进人工智能推理

得益于其综合视觉和音频输入的能力，Phi-4 在需要图表/表格理解和文档推理的任务中表现出色。基准测试表明，与其他最先进的多模态模型相比，Phi-4 的准确性更高，尤其是在结构化数据解释方面。

Phi-4多模态架构

Source: Link

视觉处理管道

视觉编码器 ：

处理图像输入并将其转换为一系列特征表示（标记）。
可能使用预训练视觉模型（如 CLIP、视觉转换器）。

标记合并：减少视觉标记的数量，在保留信息的同时提高效率。

视觉投影仪：将视觉标记转换为与标记化器兼容的格式，以便进一步处理。

音频处理管道

音频编码器

处理原始音频并将其转换为特征标记序列。
可能基于语音到文本或波形模型（如 Wav2Vec2、Whisper）。

音频投影仪：将音频嵌入映射到兼容的标记空间，以便与语言模型融合。

标记化和融合

标记化器通过在标记序列中插入图像和音频占位符来整合来自视觉、音频和文本的信息。
然后将这种统一的表示法发送给语言模型。

Phi-4 Mini模型

核心的 Phi-4 Mini 模型负责推理、生成响应和融合多模态信息。

堆叠变压器层：它采用基于转换器的架构来处理多模态输入。

LoRA 自适应（Low-Rank Adaptation）：

使用 LoRA（低秩自适应）对视觉（LoRAᵥ）和音频（LoRAₐ）模型进行微调。
LoRA 有助于有效调整预训练的权重，而不会显著增加模型的大小。

Phi-4架构如何工作？

图像和音频输入分别由各自的编码器处理。
编码表示通过投影层与语言模型的标记空间保持一致。
标记化器融合信息，为 Phi-4 Mini 模型的处理做好准备。
经 LoRA 增强的 Phi-4 Mini 模型可根据多模态上下文生成基于文本的输出。

不同基准上的Phi-4多模态比较

Phi-4多模态音频和视频基准

Source: Link

这些基准可以评估模型在 AI2D、ChartQA、DocVQA 和 InfoVQA 中的能力，这些数据集是评估多模态模型的标准数据集，尤其是在视觉问题解答（VQA）和文档理解方面。

s_AI2D（AI2D 基准）

评估图表和图像推理。
Phi-4-multimodal-instruct（68.9）的表现优于InternOmni-7B （53.9）和Gemini-2.0-Flash-Lite （62）。
Gemini-2.0-Flash (69.4)略微优于 Phi-4，而Gemini-1.5-Pro (67.7)则略低于 Phi-4。

s_ChartQA（图表问题解答）

侧重于解释图表。
Phi-4-multimodal-instruct（69）优于所有其他模型。
其次是InternOmni-7B (56.1)，但 Gemini-2.0-Flash (51.3) 和 Gemini-1.5-Pro (46.9) 的表现要差得多。

s_DocVQA（文档 VQA – 阅读文档和提取信息）

评估模型理解和回答文档问题的能力。
Phi-4-multimodal-instruct（87.3）遥遥领先。
Gemini-2.0-Flash (80.3)和Gemini-1.5-Pro (78.2)表现良好，但仍落后于 Phi-4。

s_InfoVQA（基于信息的可视化问题解答）

测试模型提取和推理图像信息的能力。
Phi-4-multimodal-instruct（63.7）再次成为表现最好的模型。
Gemini-1.5-Pro (66.1)稍微领先，但其他 Gemini 模型表现不佳。

多模态模型基准测试对比

Source: Link

Phi-4多模态语音基准

Source: Link

Phi-4-Multimodal-Instruct 在语音识别方面表现出色，在 FLEURS、OpenASR 和 CommonVoice 中击败了所有竞争对手。
Phi-4 在语音翻译方面表现不佳，不如 WhisperV3、Qwen2-Audio 和 Gemini 模型。
语音质量保证是一个弱点，Gemini-2.0-Flash 和 GPT-4o-RT 遥遥领先。
Phi-4 在音频理解方面很有竞争力，但 Gemini-2.0-Flash 略胜一筹。
语音总结能力一般，GPT-4o-RT 略胜一筹。

多模态语音基准测试对比

Source: Link

Phi-4多模态视觉基准

Source: Link

Phi-4 在 OCR、文档智能和科学推理方面表现出色。
它在多模态任务中表现出色，但在视频感知和一些数学相关基准测试中却落在了后面。
它与 Gemini-2.0-Flash 和 GPT-4o 等模型竞争激烈，但在多图像和物体存在任务方面仍有改进空间。

多模态视觉基准测试对比

Phi-4多模态视觉质量雷达图

Source: Link

雷达图的主要启示

1. Phi-4-多模态教学的优势

擅长视觉科学推理：Phi-4 在这一类别中得分最高，超过了大多数竞争对手。
在流行的综合基准测试中表现突出：它跻身于顶级模型之列，这表明它在多模态任务中的整体性能十分强劲。
在物体视觉存在验证方面具有竞争力：它在验证图像中物体是否存在方面的表现与排名靠前的模型类似。
在图表推理方面表现尚可：虽然不是最好的，但 Phi-4 在这一领域仍保持着竞争优势。

2. Phi-4 的弱点

在视觉数学推理方面表现不佳：Gemini-2.0-Flash 和 GPT-4o 在这一领域表现出色。
在多图像感知方面落后：与 GPT-4o 和 Gemini-2.0-Flash 等模型相比，Phi-4 在处理多图像或基于视频的感知方面较弱。
在文档智能方面表现一般：虽然它的表现不错，但与一些竞争对手相比，它在这一类别中并不是最好的。

亲身体验：实施Phi-4多模态

微软提供的开源资源允许开发人员探索 Phi-4 多模态的功能。下面，我们将探讨使用 Phi-4 多模态的实际应用。

所需软件包

!pip flash_attn==2.7.4.post1 torch==2.6.0 transformers==4.48.2 accelerate==1.3.0 soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 torchvision==0.21.0 backoff==2.2.1 peft==0.13.2

!pip flash_attn==2.7.4.post1 torch==2.6.0 transformers==4.48.2 accelerate==1.3.0 soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 torchvision==0.21.0 backoff==2.2.1 peft==0.13.2

必需导入项

import requests

import torch

import os

import io

from PIL import Image

import soundfile as sf

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from urllib.request import urlopen

import requests import torch import os import io from PIL import Image import soundfile as sf from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from urllib.request import urlopen

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen

定义模型路径

model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

model_path,

device_map="cuda",

torch_dtype="auto",

trust_remote_code=True,

attn_implementation='flash_attention_2',

).cuda()

model_path = "microsoft/Phi-4-multimodal-instruct" # Load model and processor processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="cuda", torch_dtype="auto", trust_remote_code=True, attn_implementation='flash_attention_2', ).cuda()

model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, 
device_map="cuda", 
torch_dtype="auto", 
trust_remote_code=True, 
attn_implementation='flash_attention_2',
).cuda()

加载生成配置

generation_config = GenerationConfig.from_pretrained(model_path)

generation_config = GenerationConfig.from_pretrained(model_path)

定义提示结构

user_prompt = '<|user|>'

assistant_prompt = '<|assistant|>'

prompt_suffix = '<|end|>'

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

图像处理

print("\n--- IMAGE PROCESSING ---")

image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'

prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'

print(f'>>> Prompt\n{prompt}')

print("\n--- IMAGE PROCESSING ---") image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg' prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}' print(f'>>> Prompt\n{prompt}')

print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

下载和打开图像

image = Image.open(requests.get(image_url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

image = Image.open(requests.get(image_url, stream=True).raw) inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

生成响应

generate_ids = model.generate(

**inputs,

max_new_tokens=1000,

generation_config=generation_config,

)

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(

generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False

)[0]

print(f'>>> Response\n{response}')

generate_ids = model.generate( **inputs, max_new_tokens=1000, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}')

generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

输入图像

中华门照片

输出

The image shows a street scene with a red stop sign in the foreground. The stop sign is mounted on a pole with a decorative top. Behind the stop sign, there is a traditional Chinese building with red and green colors and Chinese characters on the signboard. The building has a tiled roof and is adorned with red lanterns hanging from the eaves. There are several people walking on the sidewalk in front of the building. A black SUV is parked on the street, and there are two trash cans on the sidewalk. The street is lined with various shops and signs, including one for 'Optus' and another for 'Kuo'. The overall scene appears to be in an urban area with a mix of modern and traditional elements.

同样，您也可以进行音频处理

print("\n--- AUDIO PROCESSING ---")

audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"

speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."

prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'

print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file

audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model

inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(

**inputs,

max_new_tokens=1000,

generation_config=generation_config,

)

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(

generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False

)[0]

print(f'>>> Response\n{response}')

print("\n--- AUDIO PROCESSING ---") audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac" speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation." prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}' print(f'>>> Prompt\n{prompt}') # Downlowd and open audio file audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read())) # Process with the model inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0') generate_ids = model.generate( **inputs, max_new_tokens=1000, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}')

print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))
# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

使用案例：

通过实时语音转录实现人工智能新闻报道。
具有智能交互功能的声控虚拟助手。
用于全球交流的实时多语言音频翻译。

Phi-4多模态技术的更多成果

1. Phi-4 多模态图像分析

2. Phi-4 多模态数学图像分析

多模态人工智能和边缘应用的未来

Phi-4多模态技术的一个突出特点是能够在边缘设备上运行，这使其成为物联网应用和计算资源有限环境的理想解决方案。

潜在的边缘部署：

智能家居助手：集成到物联网设备中，实现高级家庭自动化。
医疗保健应用：通过多模态分析改进诊断和患者监护。
工业自动化：在制造业中实现人工智能驱动的监控和异常检测。

小结

微软的 Phi-4 Multimodal 是人工智能领域的一个突破，它将文本、视觉和语音处理无缝集成到一个紧凑、高性能的模型中。它是人工智能助手、文档处理和多语言应用的理想选择，为智能、直观的人工智能解决方案带来了新的可能性。

对于开发人员和研究人员来说，亲手使用 Phi-4 可以实现从代码生成到实时语音翻译和物联网应用的尖端创新，从而推动多模态人工智能的发展。

Phi-4 多模态微软

微软Phi-4多模态实践指南

Phi-4多模态：人工智能发展的飞跃

Phi-4多模态的主要特点

支持的模式和语言

Phi-4多模态的结构进步

1. 统一表示空间

2. 可扩展性和效率

3. 改进人工智能推理

视觉处理管道

音频处理管道

标记化和融合

Phi-4 Mini模型

Phi-4架构如何工作？

不同基准上的Phi-4多模态比较

Phi-4多模态音频和视频基准

Phi-4多模态语音基准

Phi-4多模态视觉基准

Phi-4多模态视觉质量雷达图

雷达图的主要启示

1. Phi-4-多模态教学的优势

2. Phi-4 的弱点

亲身体验：实施Phi-4多模态

所需软件包

必需导入项

定义模型路径

加载生成配置

定义提示结构

图像处理

下载和打开图像

生成响应

输入图像

输出

Phi-4多模态技术的更多成果

多模态人工智能和边缘应用的未来

小结

评论留言

取消回复

文章目录

微软Phi-4多模态实践指南

Phi-4多模态：人工智能发展的飞跃

Phi-4多模态的主要特点

支持的模式和语言

Phi-4多模态的结构进步

1. 统一表示空间

2. 可扩展性和效率

3. 改进人工智能推理

视觉处理管道

音频处理管道

标记化和融合

Phi-4 Mini模型

Phi-4架构如何工作？

不同基准上的Phi-4多模态比较

Phi-4多模态音频和视频基准

Phi-4多模态语音基准

Phi-4多模态视觉基准

Phi-4多模态视觉质量雷达图

雷达图的主要启示

1. Phi-4-多模态教学的优势

2. Phi-4 的弱点

亲身体验：实施Phi-4多模态

所需软件包

必需导入项

定义模型路径

加载生成配置

定义提示结构

图像处理

下载和打开图像

生成响应

输入图像

输出

Phi-4多模态技术的更多成果

多模态人工智能和边缘应用的未来

小结

相关文章

评论留言

取消回复

文章目录