使用去噪自動編碼器在UNSW-NB15上進行零日攻擊檢測

零日攻擊是最嚴重的網路安全威脅之一。它們利用先前未知的漏洞，從而繞過現有的入侵檢測系統 (IDS)。傳統的基於特徵的入侵檢測系統 (IDS) 在這方面失效了，因為它依賴於已知的攻擊模式。為了檢測此類攻擊，模型需要學習正常的網路行為，並在其偏離正常行為時自動標記。

一個很有前景的解決方案是應用去噪自編碼器 (DAE)，這是一種無監督深度學習模型，旨在學習正常流量的魯棒表示。其主要思想是透過在訓練過程中稍微破壞輸入，DAE 學會重建原始的、乾淨的資料版本。這迫使模型捕捉資料的本質表示，而不是記住噪聲。當面對未知的零日攻擊時，

損失函式（即重建誤差峰值）會進行異常檢測。在本文中，我們將瞭解如何在 UNSW-NB15 資料集上使用 DAE 進行零日攻擊檢測。

去噪自編碼器：核心思想

在去噪自編碼器中，我們會在輸入傳遞給編碼器之前，故意新增噪聲。然後，網路會學習重建原始的、乾淨的輸入。為了使模型專注於有意義的特徵而非細節，我們會使用隨機噪聲來破壞輸入資料。我們將其數學表達為：

去噪自編碼器

重建損失也稱為損失函式，它評估原始輸入資料 x 與重建輸出資料 x̂ 之間的差異。較低的重建誤差表明模型忽略了噪聲並保留了輸入的基本特徵。下圖顯示了去噪自動編碼器的示意圖。

去噪自編碼器

示例：二進位制輸入案例

考慮二進位制輸入 (x ∈ {0,1}。我們以機率 q 翻轉某個位或將其設定為 0；否則，我們保持不變。如果我們允許模型最小化關於損壞輸入 x 的誤差，它只會學習複製損壞部分。但是，由於我們強制它重建真實的 x，它必須從特徵之間的關係中推斷缺失的資訊。這使得 DAE 模型能夠超越記憶，學習關於輸入的更深層結構。在測試過程中，它能夠提高泛化能力。在網路安全領域，去噪自編碼器能夠檢測偏離正常模式的未知攻擊或零日攻擊。

案例研究：使用去噪自編碼器進行零日攻擊檢測

此示例說明了去噪自編碼器如何在 UNSW-NB15 資料集中檢測零日攻擊。我們訓練模型來學習正常流量的底層結構，而不會讓異常資料會影響模型。在推理過程中，模型會評估與正常模式明顯偏離的網路流量，例如與零日攻擊相關的流量，這些流量會導致較高的重構誤差，從而實現異常檢測。

步驟 1. 資料集概覽

UNSW-NB15 資料集是一個基準資料集，用於評估入侵檢測系統的效能。它包含正常樣本和九個攻擊類別，包括模糊器、Shellcode 和漏洞利用程式。為了模擬零日攻擊，我們僅對正常流量進行訓練，並保留 Shellcode 攻擊進行測試。這確保了模型能夠基於之前未見過的攻擊行為進行評估。

步驟 2. 匯入庫並載入資料集

我們匯入必要的庫並載入 UNSW-NB15 資料集。然後，我們進行數值預處理，分離標籤和分類特徵，並僅關注正常流量進行訓練。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_curve, auc
import tensorflow as tf
from tensorflow. keras import layers, Model
from tensorflow. keras.callbacks import EarlyStopping
# Load UNSW-NB15 dataset
df = pd. read_csv("UNSW_NB15.csv")
print ("Dataset shape:", df. shape)
print (df [['label’, ‘attack cat']].head())

輸出：

Dataset shape: (254004, 43)First five rows of ['label','attack_cat']:     label     attack_cat0      0          Normal1      0          Normal2      0          Normal3      0          Normal4      1         Shellcode

輸出顯示資料集有 254,004 行和 43 列。標籤 0 表示正常流量，1 表示攻擊流量。第五行是 Shellcode 攻擊，我們用它來檢測零日攻擊。

步驟 3. 預處理資料

# Define target
y = df['label']
X = df.drop(columns=['label'])
# Normal traffic for training
normal_data = X[y == 0]
# Zero-day traffic (Shellcode) for testing
zero_day_data = df[df['attack_cat'] == 'Shellcode'].drop(columns=['label','attack_cat'])
# Identify numeric and categorical features
numeric_features = normal_data.select_dtypes(include=['int64','float64']).columns
categorical_features = normal_data.select_dtypes(include=['object']).columns
# Preprocessing pipeline: scale numerics, one-hot encode categoricals
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_features)
])
# Fit only on normal traffic
X_normal = preprocessor.fit_transform(normal_data)
# Train-validation split
X_train, X_val = train_test_split(X_normal, test_size=0.2, random_state=42)
print("Training data shape:", X_train.shape)
print("Validation data shape:", X_val.shape)

輸出：

Training data shape:    (160000, 71)Validation data shape:  ( 40000, 71)

標籤被丟棄，只選擇良性樣本，即 i == 0。共有 37 個數值特徵，其中 4 個分類特徵經過獨熱編碼，構成總共 71 個輸入維度。

步驟 4. 定義最佳化去噪自編碼器

我們在輸入中新增高斯噪聲，以強制網路學習穩健的特徵。批次歸一化可以穩定訓練，而較小的瓶頸層（16 個單元）則有助於形成緊湊的潛在表徵。

input_dim = X_train. shape [1]
inp = layers.Input(shape=(input_dim,))
noisy = layers. GaussianNoise(0.1)(inp)  # Corrupt input slightly
# Encoder
x = layers.Dense(64, activation='relu')(noisy)
x = layers. BatchNormalization()(x)  # Stabilize training
bottleneck = layers.Dense(16, activation='relu')(x)
# Decoder
x = layers.Dense(64, activation='relu')(bottleneck)
x = layers. BatchNormalization()(x)
out = layers.Dense(input_dim, activation='linear')(x)  # Use linear for standardized input
autoencoder = Model(inputs=inp, outputs=out)
autoencoder. compile(optimizer='adam', loss='mse')
autoencoder.summary()

輸出：

Model: "model"_________________________________________________________________Layer (type)                        Output Shape                          Param #=================================================================input_1 (InputLayer)                [(None, 71)]                             0gaussian_noise (GaussianNoise)      (None, 71)                        0dense (Dense)                       (None, 64)                                4,608batch_normalization (BatchNormalization) (None, 64)        128dense_1 (Dense)                     (None, 16)                              1,040dense_2 (Dense)                     (None, 64)                               1,088batch_normalization_1 (BatchNormalization) (None, 64)     128dense_3 (Dense)                     (None, 71)                               4,615=================================================================Total params: 11,607  Trainable params: 11,351  Non-trainable params:   256  _________________________________________________________________

步驟 5. 使用早期停止法訓練模型

# Early stopping to avoid overfitting
es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
print("Training started...")
history = autoencoder.fit (
    X_train, X_train,
    epochs=50,
    batch_size=512,  # larger batch for faster training
    validation_data=(X_val, X_val),
    shuffle=True,
    callbacks=[es]
)
print ("Training completed!")

訓練損失曲線

plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.xlabel("Epochs")
plt.ylabel("MSE Loss")
plt.legend()
plt.title("Training vs Validation Loss")
plt.show()

輸出：

Training started...Epoch 1/50313/313 [==============================] - 2s  6ms/step - loss: 0.0254 - val_loss: 0.0181Epoch 2/50313/313 [==============================] - 2s  6ms/step - loss: 0.0158 - val_loss: 0.0145Epoch 3/50313/313 [==============================] - 2s  6ms/step - loss: 0.0123 - val_loss: 0.0127Epoch 4/50313/313 [==============================] - 2s  6ms/step - loss: 0.0106 - val_loss: 0.0108Epoch 5/50313/313 [==============================] - 2s  6ms/step - loss: 0.0094 - val_loss: 0.0097Epoch 6/50313/313 [==============================] - 2s  6ms/step - loss: 0.0086 - val_loss: 0.0085Epoch 7/50313/313 [==============================] - 2s  6ms/step - loss: 0.0082 - val_loss: 0.0083Epoch 8/50313/313 [==============================] - 2s  6ms/step - loss: 0.0080 - val_loss: 0.0086Restoring model weights from the end of the best epoch: 7.Epoch 00008: early stoppingTraining completed!

步驟 6. 零日漏洞檢測

# Transform datasets
X_normal_test = preprocessor.transform(normal_data)
X_zero_day_test = preprocessor.transform(zero_day_data)
# Compute reconstruction errors
recon_normal = np.mean(np.square(X_normal_test - autoencoder.predict(X_normal_test, batch_size=512)), axis=1)
recon_zero = np.mean(np.square(X_zero_day_test - autoencoder.predict(X_zero_day_test, batch_size=512)), axis=1)
# Threshold: 95th percentile of normal errors
threshold = np.percentile(recon_normal, 95)
print("Threshold:", threshold)
print("False Alarm Rate (Normal flagged as anomaly):", np.mean(recon_normal > threshold))
print("Detection Rate (Zero-Day detected):", np.mean(recon_zero > threshold))

輸出：

Threshold: 0.0121False Alarm Rate (normal→anomaly): 0.0480Detection Rate (Shellcode zero-day): 0.9150

我們將閾值設定為良性流量錯誤的第 95 個百分位數。4.8% 的正常流量被標記為誤報，而大約 91.5% 的 Shellcode 流量超過閾值並被正確識別為真報。

步驟 7. 視覺化

重建誤差直方圖

plt. figure(figsize=(8,5))
plt.hist(recon_normal, bins=50, alpha=0.6, label="Normal")
plt.hist(recon_zero, bins=50, alpha=0.6, label="Zero-Day (Shellcode)")
plt.axvline(threshold, color='red', linestyle='--', label='Threshold')
plt.xlabel("Reconstruction Error")
plt.ylabel("Frequency")
plt.legend()
plt.title("Normal vs Zero-Day Error Distribution")
plt.show()

輸出：

重建誤差疊加直方圖

良性（藍色）和零日（橙色）流量的重建誤差疊加直方圖

ROC 曲線

y_true = np.concatenate([np.zeros_like(recon_normal), np.ones_like(recon_zero)])
y_scores = np.concatenate([recon_normal, recon_zero])
fpr, tpr, _ = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0,1],[0,1],'--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.title("ROC Curve for Zero-Day Detection")
plt.show()

輸出：

ROC 曲線

ROC 曲線展示了真陽性率與假陽性率；AUC = 0.93。

侷限性

其侷限性如下：

DAE 可以檢測異常，但無法對攻擊型別進行分類。
選擇合適的閾值取決於資料集的選擇，並且可能需要進行微調。
在僅使用正常流量進行訓練時效果最佳。

關鍵要點

降噪自編碼器能夠有效檢測未見零日攻擊。
使用批歸一化、更大的批次大小和提前停止訓練可以提高訓練穩定性。
視覺化（損失曲線、誤差直方圖、ROC）使模型行為易於解釋。

小結

本教程演示瞭如何使用 UNSW-NB15 資料集，使用降噪自編碼器檢測網路流量中的零日攻擊。透過學習正常流量的穩健模式，該模型可以標記未見攻擊資料中的異常。 DAE 本身為現代入侵檢測系統提供了堅實的基礎，並且可以與先進的架構或監督分類器相結合，構建全面的入侵檢測系統。

損失函式網路安全零日攻擊

使用去噪自動編碼器在UNSW-NB15上進行零日攻擊檢測

文章目录

去噪自編碼器：核心思想

示例：二進位制輸入案例

案例研究：使用去噪自編碼器進行零日攻擊檢測

步驟 1. 資料集概覽

步驟 2. 匯入庫並載入資料集

步驟 3. 預處理資料

步驟 4. 定義最佳化去噪自編碼器

步驟 5. 使用早期停止法訓練模型

步驟 6. 零日漏洞檢測

步驟 7. 視覺化

侷限性

關鍵要點

小結

評論留言

取消回覆

使用去噪自動編碼器在UNSW-NB15上進行零日攻擊檢測

文章目录

去噪自編碼器：核心思想

示例：二進位制輸入案例

案例研究：使用去噪自編碼器進行零日攻擊檢測

步驟 1. 資料集概覽

步驟 2. 匯入庫並載入資料集

步驟 3. 預處理資料

步驟 4. 定義最佳化去噪自編碼器

步驟 5. 使用早期停止法訓練模型

步驟 6. 零日漏洞檢測

步驟 7. 視覺化

侷限性

關鍵要點

小結

相關文章

評論留言

取消回覆