使用去噪自动编码器在UNSW-NB15上进行零日攻击检测

零日攻击是最严重的网络安全威胁之一。它们利用先前未知的漏洞，从而绕过现有的入侵检测系统 (IDS)。传统的基于特征的入侵检测系统 (IDS) 在这方面失效了，因为它依赖于已知的攻击模式。为了检测此类攻击，模型需要学习正常的网络行为，并在其偏离正常行为时自动标记。

一个很有前景的解决方案是应用去噪自编码器 (DAE)，这是一种无监督深度学习模型，旨在学习正常流量的鲁棒表示。其主要思想是通过在训练过程中稍微破坏输入，DAE 学会重建原始的、干净的数据版本。这迫使模型捕捉数据的本质表示，而不是记住噪声。当面对未知的零日攻击时，

损失函数（即重建误差峰值）会进行异常检测。在本文中，我们将了解如何在 UNSW-NB15 数据集上使用 DAE 进行零日攻击检测。

去噪自编码器：核心思想

在去噪自编码器中，我们会在输入传递给编码器之前，故意添加噪声。然后，网络会学习重建原始的、干净的输入。为了使模型专注于有意义的特征而非细节，我们会使用随机噪声来破坏输入数据。我们将其数学表达为：

去噪自编码器

重建损失也称为损失函数，它评估原始输入数据 x 与重建输出数据 x̂ 之间的差异。较低的重建误差表明模型忽略了噪声并保留了输入的基本特征。下图显示了去噪自动编码器的示意图。

去噪自编码器

示例：二进制输入案例

考虑二进制输入 (x ∈ {0,1}。我们以概率 q 翻转某个位或将其设置为 0；否则，我们保持不变。如果我们允许模型最小化关于损坏输入 x 的误差，它只会学习复制损坏部分。但是，由于我们强制它重建真实的 x，它必须从特征之间的关系中推断缺失的信息。这使得 DAE 模型能够超越记忆，学习关于输入的更深层结构。在测试过程中，它能够提高泛化能力。在网络安全领域，去噪自编码器能够检测偏离正常模式的未知攻击或零日攻击。

案例研究：使用去噪自编码器进行零日攻击检测

此示例说明了去噪自编码器如何在 UNSW-NB15 数据集中检测零日攻击。我们训练模型来学习正常流量的底层结构，而不会让异常数据会影响模型。在推理过程中，模型会评估与正常模式明显偏离的网络流量，例如与零日攻击相关的流量，这些流量会导致较高的重构误差，从而实现异常检测。

步骤 1. 数据集概览

UNSW-NB15 数据集是一个基准数据集，用于评估入侵检测系统的性能。它包含正常样本和九个攻击类别，包括模糊器、Shellcode 和漏洞利用程序。为了模拟零日攻击，我们仅对正常流量进行训练，并保留 Shellcode 攻击进行测试。这确保了模型能够基于之前未见过的攻击行为进行评估。

步骤 2. 导入库并加载数据集

我们导入必要的库并加载 UNSW-NB15 数据集。然后，我们进行数值预处理，分离标签和分类特征，并仅关注正常流量进行训练。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_curve, auc
import tensorflow as tf
from tensorflow. keras import layers, Model
from tensorflow. keras.callbacks import EarlyStopping
# Load UNSW-NB15 dataset
df = pd. read_csv("UNSW_NB15.csv")
print ("Dataset shape:", df. shape)
print (df [['label’, ‘attack cat']].head())

输出：

Dataset shape: (254004, 43)First five rows of ['label','attack_cat']:     label     attack_cat0      0          Normal1      0          Normal2      0          Normal3      0          Normal4      1         Shellcode

输出显示数据集有 254,004 行和 43 列。标签 0 表示正常流量，1 表示攻击流量。第五行是 Shellcode 攻击，我们用它来检测零日攻击。

步骤 3. 预处理数据

# Define target
y = df['label']
X = df.drop(columns=['label'])
# Normal traffic for training
normal_data = X[y == 0]
# Zero-day traffic (Shellcode) for testing
zero_day_data = df[df['attack_cat'] == 'Shellcode'].drop(columns=['label','attack_cat'])
# Identify numeric and categorical features
numeric_features = normal_data.select_dtypes(include=['int64','float64']).columns
categorical_features = normal_data.select_dtypes(include=['object']).columns
# Preprocessing pipeline: scale numerics, one-hot encode categoricals
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_features)
])
# Fit only on normal traffic
X_normal = preprocessor.fit_transform(normal_data)
# Train-validation split
X_train, X_val = train_test_split(X_normal, test_size=0.2, random_state=42)
print("Training data shape:", X_train.shape)
print("Validation data shape:", X_val.shape)

输出：

Training data shape:    (160000, 71)Validation data shape:  ( 40000, 71)

标签被丢弃，只选择良性样本，即 i == 0。共有 37 个数值特征，其中 4 个分类特征经过独热编码，构成总共 71 个输入维度。

步骤 4. 定义优化去噪自编码器

我们在输入中添加高斯噪声，以强制网络学习稳健的特征。批量归一化可以稳定训练，而较小的瓶颈层（16 个单元）则有助于形成紧凑的潜在表征。

input_dim = X_train. shape [1]
inp = layers.Input(shape=(input_dim,))
noisy = layers. GaussianNoise(0.1)(inp)  # Corrupt input slightly
# Encoder
x = layers.Dense(64, activation='relu')(noisy)
x = layers. BatchNormalization()(x)  # Stabilize training
bottleneck = layers.Dense(16, activation='relu')(x)
# Decoder
x = layers.Dense(64, activation='relu')(bottleneck)
x = layers. BatchNormalization()(x)
out = layers.Dense(input_dim, activation='linear')(x)  # Use linear for standardized input
autoencoder = Model(inputs=inp, outputs=out)
autoencoder. compile(optimizer='adam', loss='mse')
autoencoder.summary()

输出：

Model: "model"_________________________________________________________________Layer (type)                        Output Shape                          Param #=================================================================input_1 (InputLayer)                [(None, 71)]                             0gaussian_noise (GaussianNoise)      (None, 71)                        0dense (Dense)                       (None, 64)                                4,608batch_normalization (BatchNormalization) (None, 64)        128dense_1 (Dense)                     (None, 16)                              1,040dense_2 (Dense)                     (None, 64)                               1,088batch_normalization_1 (BatchNormalization) (None, 64)     128dense_3 (Dense)                     (None, 71)                               4,615=================================================================Total params: 11,607  Trainable params: 11,351  Non-trainable params:   256  _________________________________________________________________

步骤 5. 使用早期停止法训练模型

# Early stopping to avoid overfitting
es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
print("Training started...")
history = autoencoder.fit (
    X_train, X_train,
    epochs=50,
    batch_size=512,  # larger batch for faster training
    validation_data=(X_val, X_val),
    shuffle=True,
    callbacks=[es]
)
print ("Training completed!")

训练损失曲线

plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.xlabel("Epochs")
plt.ylabel("MSE Loss")
plt.legend()
plt.title("Training vs Validation Loss")
plt.show()

输出：

Training started...Epoch 1/50313/313 [==============================] - 2s  6ms/step - loss: 0.0254 - val_loss: 0.0181Epoch 2/50313/313 [==============================] - 2s  6ms/step - loss: 0.0158 - val_loss: 0.0145Epoch 3/50313/313 [==============================] - 2s  6ms/step - loss: 0.0123 - val_loss: 0.0127Epoch 4/50313/313 [==============================] - 2s  6ms/step - loss: 0.0106 - val_loss: 0.0108Epoch 5/50313/313 [==============================] - 2s  6ms/step - loss: 0.0094 - val_loss: 0.0097Epoch 6/50313/313 [==============================] - 2s  6ms/step - loss: 0.0086 - val_loss: 0.0085Epoch 7/50313/313 [==============================] - 2s  6ms/step - loss: 0.0082 - val_loss: 0.0083Epoch 8/50313/313 [==============================] - 2s  6ms/step - loss: 0.0080 - val_loss: 0.0086Restoring model weights from the end of the best epoch: 7.Epoch 00008: early stoppingTraining completed!

步骤 6. 零日漏洞检测

# Transform datasets
X_normal_test = preprocessor.transform(normal_data)
X_zero_day_test = preprocessor.transform(zero_day_data)
# Compute reconstruction errors
recon_normal = np.mean(np.square(X_normal_test - autoencoder.predict(X_normal_test, batch_size=512)), axis=1)
recon_zero = np.mean(np.square(X_zero_day_test - autoencoder.predict(X_zero_day_test, batch_size=512)), axis=1)
# Threshold: 95th percentile of normal errors
threshold = np.percentile(recon_normal, 95)
print("Threshold:", threshold)
print("False Alarm Rate (Normal flagged as anomaly):", np.mean(recon_normal > threshold))
print("Detection Rate (Zero-Day detected):", np.mean(recon_zero > threshold))

输出：

Threshold: 0.0121False Alarm Rate (normal→anomaly): 0.0480Detection Rate (Shellcode zero-day): 0.9150

我们将阈值设置为良性流量错误的第 95 个百分位数。4.8% 的正常流量被标记为误报，而大约 91.5% 的 Shellcode 流量超过阈值并被正确识别为真报。

步骤 7. 可视化

重建误差直方图

plt. figure(figsize=(8,5))
plt.hist(recon_normal, bins=50, alpha=0.6, label="Normal")
plt.hist(recon_zero, bins=50, alpha=0.6, label="Zero-Day (Shellcode)")
plt.axvline(threshold, color='red', linestyle='--', label='Threshold')
plt.xlabel("Reconstruction Error")
plt.ylabel("Frequency")
plt.legend()
plt.title("Normal vs Zero-Day Error Distribution")
plt.show()

输出：

重建误差叠加直方图

良性（蓝色）和零日（橙色）流量的重建误差叠加直方图

ROC 曲线

y_true = np.concatenate([np.zeros_like(recon_normal), np.ones_like(recon_zero)])
y_scores = np.concatenate([recon_normal, recon_zero])
fpr, tpr, _ = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0,1],[0,1],'--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.title("ROC Curve for Zero-Day Detection")
plt.show()

输出：

ROC 曲线

ROC 曲线展示了真阳性率与假阳性率；AUC = 0.93。

局限性

其局限性如下：

DAE 可以检测异常，但无法对攻击类型进行分类。
选择合适的阈值取决于数据集的选择，并且可能需要进行微调。
在仅使用正常流量进行训练时效果最佳。

关键要点

降噪自编码器能够有效检测未见零日攻击。
使用批归一化、更大的批量大小和提前停止训练可以提高训练稳定性。
可视化（损失曲线、误差直方图、ROC）使模型行为易于解释。

小结

本教程演示了如何使用 UNSW-NB15 数据集，使用降噪自编码器检测网络流量中的零日攻击。通过学习正常流量的稳健模式，该模型可以标记未见攻击数据中的异常。 DAE 本身为现代入侵检测系统提供了坚实的基础，并且可以与先进的架构或监督分类器相结合，构建全面的入侵检测系统。

损失函数网络安全零日攻击

使用去噪自动编码器在UNSW-NB15上进行零日攻击检测

文章目录

去噪自编码器：核心思想

示例：二进制输入案例

案例研究：使用去噪自编码器进行零日攻击检测

步骤 1. 数据集概览

步骤 2. 导入库并加载数据集

步骤 3. 预处理数据

步骤 4. 定义优化去噪自编码器

步骤 5. 使用早期停止法训练模型

步骤 6. 零日漏洞检测

步骤 7. 可视化

局限性

关键要点

小结

评论留言

取消回复

使用去噪自动编码器在UNSW-NB15上进行零日攻击检测

文章目录

去噪自编码器：核心思想

示例：二进制输入案例

案例研究：使用去噪自编码器进行零日攻击检测

步骤 1. 数据集概览

步骤 2. 导入库并加载数据集

步骤 3. 预处理数据

步骤 4. 定义优化去噪自编码器

步骤 5. 使用早期停止法训练模型

步骤 6. 零日漏洞检测

步骤 7. 可视化

局限性

关键要点

小结

相关文章

评论留言

取消回复