什么是gpt-oss-safeguard？OpenAI的策略驱动安全模型

gpt-oss-safeguard

告别内容审核吧！一种全新的开放模型横空出世，它们能够真正理解你的规则，而不是盲目猜测。隆重介绍 gpt-oss-safeguard：这些模型能够解读你的规则，并以清晰的推理过程强制执行。无需大规模重新训练，也无需黑箱安全调用。取而代之的是，由你掌控的灵活开放的系统。在本文中，我们将深入剖析 safeguard 模型是什么、它们如何运作、它们的优势（以及不足之处），并指导你如何立即开始测试自己的策略。

什么是gpt-oss-safeguard？

这些模型基于 gpt-oss 架构构建，总共拥有 200 亿个参数（另一个变体拥有 1200 亿个参数），并针对安全分类任务进行了专门的微调，同时支持 Harmony 响应格式。Harmony 响应格式将推理过程分离到专用通道中，以提高可审计性和透明度。该模型体现了 OpenAI 对深度防御的理念。

什么是gpt-oss-safeguard？

Source: OpenAI

该模型同时接收两个输入：

策略（~系统指令）
作为该策略目标的内容（~查询）

处理完这些输入后，模型会得出关于内容归属的结论，并给出相应的推理。

如何访问？

您可以在 Hugging Face 的 HuggingFace Collections 中访问 gpt-oss-safeguard 模型。或者，您也可以通过提供 Playground 的在线平台访问，例如 Groq、OpenRouter 等。

本文中的演示均在 Groq 提供的 gpt-oss-safeguard Playground 上完成。

Groq 提供的 gpt-oss-safeguard

Source: Groq

实践操作：在我们自己的策略上测试模型

为了测试模型（20b 变体）在输出清理中对策略的理解和运用情况，我使用一条专门用于过滤动物名称的策略对其进行了测试：

Policy: Animal Name Detection v1.0ObjectiveDecide if the input text contains one or more animal names. Return a label and the list of detected names.LabelsANIMAL_NAME_PRESENT — At least one animal name is present.ANIMAL_NAME_ABSENT — No animal names are present.UNCERTAIN — Ambiguous; the model cannot confidently decide.DefinitionsAnimal: Any member of kingdom Animalia (mammals, birds, reptiles, amphibians, fish, insects, arachnids, mollusks, etc.), including extinct species (e.g., dinosaur names) and zodiac animals.What counts as a “name”: Canonical common names (dog, African grey parrot), scientific/Latin binomials (Canis lupus), multiword names (sea lion), slang/colloquialisms (kitty, pup), and animal emojis (🐶, 🐍).Morphology: Case-insensitive; singular/plural both count; hyphenation and spacing variants count (sea-lion/sea lion).Languages: Apply in any language; if the word is an animal in that language, it counts (e.g., perro, gato).Exclusions / DisambiguationSubstrings inside unrelated words do not count (cat in “catastrophe”, ant in “antique”).Food dishes or products only count if an animal name appears as a standalone token or clear multiword name (e.g., “chicken curry” → counts; “hotdog” → does not).Brands/teams/models (Jaguar car, Detroit Lions) count only if the text clearly references the animal, not the product/entity. If ambiguous → UNCERTAIN.Proper names/nicknames (Tiger Woods) → mark ANIMAL_NAME_PRESENT (animal token “tiger” exists), but note it’s a proper noun.Fictional/cryptids (dragon, unicorn) → do not count unless your use case explicitly wants them. If unsure → UNCERTAIN.Required Output Format (JSON){  "label": "ANIMAL_NAME_PRESENT | ANIMAL_NAME_ABSENT | UNCERTAIN",  "animals_detected": ["list", "of", "normalized", "names"],  "notes": "brief justification; mention any ambiguities",  "confidence": 0.0}Decision RulesTokenize text; look for standalone animal tokens, valid multiword animal names, scientific names, or animal emojis.Normalize matches (lowercase; strip punctuation; collapse hyphens/spaces).Apply exclusions; if only substrings or ambiguous brand/team references remain, use ANIMAL_NAME_ABSENT or UNCERTAIN accordingly.If at least one valid match remains → ANIMAL_NAME_PRESENT.Set confidence higher when the match is unambiguous (e.g., “There’s a dog and a cat here.”), lower when proper nouns or brands could confuse the intent.Examples“Show me pictures of otters.” → ANIMAL_NAME_PRESENT; ["otter"]“The Lions won the game.” → UNCERTAIN (team vs animal)“I bought a Jaguar.” → UNCERTAIN (car vs animal)“I love 🐘 and giraffes.” → ANIMAL_NAME_PRESENT; ["elephant","giraffe"]“This is a catastrophe.” → ANIMAL_NAME_ABSENT“Cook chicken with rosemary.” → ANIMAL_NAME_PRESENT; ["chicken"]“Canis lupus populations are rising.” → ANIMAL_NAME_PRESENT; ["canis lupus"]“Necesito adoptar un perro o un gato.” → ANIMAL_NAME_PRESENT; ["perro","gato"]“I had a hotdog.” → ANIMAL_NAME_ABSENT“Tiger played 18 holes.” → ANIMAL_NAME_PRESENT; ["tiger"] (proper noun; note in notes)

查询： “The quick brown fox jumps over the lazy dog.”

回复：

结果正确，并按照我之前概述的格式提供。我本可以进行更极端的测试，但有限的测试本身已经令人满意。此外，由于模型的一个限制（详见“限制”部分），使用密集型模型行不通。

基准测试：gpt-oss-safeguard的性能表现

我们使用 OpenAI 的内部和外部评估数据集对 safeguard 模型进行了评估。

内部审核评估

Source: OpenAI

safeguard 模型和内部安全推理器在多策略准确率方面优于 gpt-5-thinking 和 gpt-oss 开源模型。考虑到 safeguard 模型的参数数量较少，其性能优于 gpt-5-thinking 尤其令人惊讶。

外部审核评估

ToxicChat 测试

Source: OpenAI

在 ToxicChat 测试中，内部开发的 Safety Reasoner 排名最高，其次是 gpt-5-thinking。这两个模型都略微优于 gpt-oss-safeguard-120b 和 20b。尽管如此，由于其体积更小、部署效率更高（与那些庞大的模型相比），safeguard 仍然适用于此任务。

OpenAI 的内部安全模型

Source: OpenAI

使用内部安全策略，gpt-oss-safeguard 的性能略优于其他测试模型，包括内部开发的 Safety Reasoner（OpenAI 的内部安全模型）。与 Safety Reasoner 的差距并不具有统计学意义，但 safety 在该基准测试中领先。

局限性

性能低于专用分类器：专为故障保护和内容审核设计的分类器性能远超安全防护模型。
计算成本：与轻量级分类器相比，这些模型需要更多的计算资源（时间、硬件）。如果可扩展性是必要条件，这一点尤其令人担忧。
推理链中的错觉：即使结论令人信服，也不能保证其推理过程完全正确。如果策略较为简短，这种情况尤为突出。可疑的推理却得出了正确的结论：

可疑的推理却得出了正确的结论

多语言兼容性缺陷：安全保障模型的有效性仅限于英语这种通用语言。因此，如果您的内容或政策环境涵盖英语以外的其他语言，则可能会出现性能下降的情况。

gpt-oss-safeguard应用案例

以下是此基于策略的安全机制的一些应用案例：

信任与安全内容审核：结合上下文审查用户内容，发现违规行为，并集成到实时审核系统和审核工具中。
基于策略的分类：直接应用已编写的策略来指导决策，无需重新训练即可即时更改规则。
自动化分诊和审核助手：作为推理辅助工具，解释决策、引用所用准则，并将疑难案例上报给人工审核。
策略测试和实验：预览新规则的运行效果，在真实环境中测试不同版本，并及早发现不明确或过于严格的策略。

小结

这是朝着安全可靠的LLM（语言学习模型）迈出的正确一步。就目前而言，它并没有什么实际意义。该模型显然是为特定用户群体量身定制的，并非面向普通用户。对于大多数用户来说，Gpt-oss-safeguard 可以与 gpt-oss 相提并论。但它为未来开发安全响应机制提供了一个有用的框架。与其说它是一个完整的模型，不如说它是gpt-oss的版本升级。但它所提供的，是在无需大量硬件的情况下，实现安全模型使用的承诺。

gpt-oss-safeguard OpenAI

什么是gpt-oss-safeguard？OpenAI的策略驱动安全模型

文章目录

什么是gpt-oss-safeguard？

如何访问？

实践操作：在我们自己的策略上测试模型

基准测试：gpt-oss-safeguard的性能表现

内部审核评估

外部审核评估

局限性

gpt-oss-safeguard应用案例

小结

评论留言

取消回复

什么是gpt-oss-safeguard？OpenAI的策略驱动安全模型

文章目录

什么是gpt-oss-safeguard？

如何访问？

实践操作：在我们自己的策略上测试模型

基准测试：gpt-oss-safeguard的性能表现

内部审核评估

外部审核评估

局限性

gpt-oss-safeguard应用案例

小结

相关文章

评论留言

取消回复