使用隔離樹系進行多重變數異常偵測

發行項
01/23/2024

本文說明如何在 Apache Spark 上使用 SynapseML 進行多重異常偵測。多重變數異常偵測允許偵測許多變數或時間範圍之間的異常，並考慮不同變數之間的所有相互關聯和相依性。在此案例中，我們使用 SynapseML 來定型隔離樹系模型以進行多重變數異常偵測，然後我們會使用定型模型來推斷數據集內包含三個 IoT 感測器綜合測量的多重變數異常。

若要深入瞭解隔離樹系模型，請參閱劉等人的原始檔。

必要條件

將筆記本附加至 Lakehouse。在左側，選取 [新增 ] 以新增現有的 Lakehouse 或建立 Lakehouse。

連結庫匯入

from IPython import get_ipython
from IPython.terminal.interactiveshell import TerminalInteractiveShell
import uuid
import mlflow

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
from pyspark.ml import Pipeline

from synapse.ml.isolationforest import *

from synapse.ml.explainers import *

%matplotlib inline

from pyspark.sql import SparkSession

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()

from synapse.ml.core.platform import *

if running_on_synapse():
    shell = TerminalInteractiveShell.instance()
    shell.define_macro("foo", """a,b=10,20""")

輸入資料

# Table inputs
timestampColumn = "timestamp"  # str: the name of the timestamp column in the table
inputCols = [
    "sensor_1",
    "sensor_2",
    "sensor_3",
]  # list(str): the names of the input variables

# Training Start time, and number of days to use for training:
trainingStartTime = (
    "2022-02-24T06:00:00Z"  # datetime: datetime for when to start the training
)
trainingEndTime = (
    "2022-03-08T23:55:00Z"  # datetime: datetime for when to end the training
)
inferenceStartTime = (
    "2022-03-09T09:30:00Z"  # datetime: datetime for when to start the training
)
inferenceEndTime = (
    "2022-03-20T23:55:00Z"  # datetime: datetime for when to end the training
)

# Isolation Forest parameters
contamination = 0.021
num_estimators = 100
max_samples = 256
max_features = 1.0

讀取數據

df = (
    spark.read.format("csv")
    .option("header", "true")
    .load(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/generated_sample_mvad_data.csv"
    )
)

將數據行轉換成適當的數據類型

df = (
    df.orderBy(timestampColumn)
    .withColumn("timestamp", F.date_format(timestampColumn, "yyyy-MM-dd'T'HH:mm:ss'Z'"))
    .withColumn("sensor_1", F.col("sensor_1").cast(DoubleType()))
    .withColumn("sensor_2", F.col("sensor_2").cast(DoubleType()))
    .withColumn("sensor_3", F.col("sensor_3").cast(DoubleType()))
    .drop("_c5")
)

display(df)

定型數據準備

# filter to data with timestamps within the training window
df_train = df.filter(
    (F.col(timestampColumn) >= trainingStartTime)
    & (F.col(timestampColumn) <= trainingEndTime)
)
display(df_train)

測試數據準備

# filter to data with timestamps within the inference window
df_test = df.filter(
    (F.col(timestampColumn) >= inferenceStartTime)
    & (F.col(timestampColumn) <= inferenceEndTime)
)
display(df_test)

定型隔離樹系模型

isolationForest = (
    IsolationForest()
    .setNumEstimators(num_estimators)
    .setBootstrap(False)
    .setMaxSamples(max_samples)
    .setMaxFeatures(max_features)
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(contamination)
    .setContaminationError(0.01 * contamination)
    .setRandomSeed(1)
)

接下來，我們會建立 ML 管線來定型隔離樹系模型。我們也示範如何建立 MLflow 實驗並註冊已定型的模型。

只有在稍後存取定型的模型時，才嚴格要求 MLflow 模型註冊。若要定型模型，並在相同的筆記本中執行推斷，模型物件模型就已足夠。

va = VectorAssembler(inputCols=inputCols, outputCol="features")
pipeline = Pipeline(stages=[va, isolationForest])
model = pipeline.fit(df_train)

執行推斷

載入定型隔離樹系模型

執行推斷

df_test_pred = model.transform(df_test)
display(df_test_pred)

預先製作異常偵測程式

Azure AI 異常偵測程式

最新點的異常狀態：使用上述點產生模型，並判斷最新點是否異常（Scala、 Python）
尋找異常：使用整個數列產生模型，並在數列中尋找異常狀況（Scala、 Python）

共用方式為

使用隔離樹系進行多重變數異常偵測

必要條件

連結庫匯入

輸入資料

讀取數據

定型數據準備

測試數據準備

定型隔離樹系模型

執行推斷

預先製作異常偵測程式

意見反應

意見反應

其他資源

共用方式為

使用隔離樹系進行多重變數異常偵測

必要條件

連結庫匯入

輸入資料

讀取數據

定型數據準備

測試數據準備

定型隔離樹系模型

執行推斷

預先製作 異常偵測程式

相關內容

意見反應

意見反應

其他資源

預先製作異常偵測程式