Tutorial: Learn Multivariate Anomaly Detection in one hour
Anomaly Detector with Multivariate Anomaly Detection (MVAD) is an advanced AI tool for detecting anomalies from a group of metrics in an unsupervised manner.
In general, you could take these steps to use MVAD:
- Create an Anomaly Detector resource that supports MVAD on Azure.
- Prepare your data.
- Train an MVAD model.
- Query the status of your model.
- Detect anomalies with the trained MVAD model.
- Retrieve and interpret the inference results.
In this tutorial, you'll:
- Understand how to prepare your data in a correct format.
- Understand how to train and inference with MVAD.
- Understand the input parameters and how to interpret the output in inference results.
1. Create an Anomaly Detector resource that supports MVAD
- Create an Azure subscription if you don't have one - Create one for free
- Once you have your Azure subscription, create an Anomaly Detector resource in the Azure portal to get your API key and API endpoint.
Note
During preview stage, MVAD is available in limited regions only. Please bookmark What's new in Anomaly Detector to keep up to date with MVAD region roll-outs. You could also file a GitHub issue or contact us at AnomalyDetector@microsoft.com to request for specific regions.
2. Data preparation
Then you need to prepare your training data (and inference data).
Input data schema
MVAD detects anomalies from a group of metrics, and we call each metric a variable or a time series.
You could download the sample data file from Microsoft to check the accepted schema from: https://aka.ms/AnomalyDetector/MVADSampleData
Each variable must have two and only two fields,
timestampandvalue, and should be stored in a comma-separated values (CSV) file.The column names of the CSV file should be precisely
timestampandvalue, case-sensitive.The
timestampvalues should conform to ISO 8601; thevaluecould be integers or decimals with any number of decimal places. A good example of the content of a CSV file:timestamp value 2019-04-01T00:00:00Z 5 2019-04-01T00:01:00Z 3.6 2019-04-01T00:02:00Z 4 ... ... Note
If your timestamps have hours, minutes, and/or seconds, ensure that they're properly rounded up before calling the APIs.
For example, if your data frequency is supposed to be one data point every 30 seconds, but you're seeing timestamps like "12:00:01" and "12:00:28", it's a strong signal that you should pre-process the timestamps to new values like "12:00:00" and "12:00:30".
For details, please refer to the "Timestamp round-up" section in the best practices document.
The name of the csv file will be used as the variable name and should be unique. For example, "temperature.csv" and "humidity.csv".
Variables for training and variables for inference should be consistent. For example, if you are using
series_1,series_2,series_3,series_4, andseries_5for training, you should provide exactly the same variables for inference.CSV files should be compressed into a zip file and uploaded to an Azure blob container. The zip file can have whatever name you want.
Folder structure
A common mistake in data preparation is extra folders in the zip file. For example, assume the name of the zip file is series.zip. Then after decompressing the files to a new folder ./series, the correct path to CSV files is ./series/series_1.csv and a wrong path could be ./series/foo/bar/series_1.csv.
The correct example of the directory tree after decompressing the zip file in Windows
.
└── series
├── series_1.csv
├── series_2.csv
├── series_3.csv
├── series_4.csv
└── series_5.csv
An incorrect example of the directory tree after decompressing the zip file in Windows
.
└── series
└── series
├── series_1.csv
├── series_2.csv
├── series_3.csv
├── series_4.csv
└── series_5.csv
Tools for zipping and uploading data
In this section, we share some sample code and tools which you could copy and edit to add into your own application logic which deals with MVAD input data.
Compressing CSV files in *nix
zip -j series.zip series/*.csv
Compressing CSV files in Windows
- Navigate into the folder with all the CSV files.
- Select all the CSV files you need.
- Right click on one of the CSV files and select
Send to. - Select
Compressed (zipped) folderfrom the drop-down. - Rename the zip file as needed.
Python code zipping & uploading data to Azure Blob Storage
You could refer to this doc to learn how to upload a file to Azure Blob.
Or, you could refer to the sample code below that can do the zipping and uploading for you. You could copy and save the Python code in this section as a .py file (for example, zipAndUpload.py) and run it using command lines like these:
python zipAndUpload.py -s "foo\bar" -z test123.zip -c {azure blob connection string} -n container_xxxThis command will compress all the CSV files in
foo\barinto a single zip file namedtest123.zip. It will uploadtest123.zipto the containercontainer_xxxin your blob.python zipAndUpload.py -s "foo\bar" -z test123.zip -c {azure blob connection string} -n container_xxx -rThis command will do the same thing as the above, but it will delete the zip file
test123.zipafter uploading successfully.
Arguments:
--source-folder,-s, path to the source folder containing CSV files--zipfile-name,-z, name of the zip file--connection-string,-c, connection string to your blob--container-name,-n, name of the container--remove-zipfile,-r, if on, remove the zip file
import os
import argparse
import shutil
import sys
from azure.storage.blob import BlobClient
import zipfile
class ZipError(Exception):
pass
class UploadError(Exception):
pass
def zip_file(root, name):
try:
z = zipfile.ZipFile(name, "w", zipfile.ZIP_DEFLATED)
for f in os.listdir(root):
if f.endswith("csv"):
z.write(os.path.join(root, f), f)
z.close()
print("Compress files success!")
except Exception as ex:
raise ZipError(repr(ex))
def upload_to_blob(file, conn_str, cont_name, blob_name):
try:
blob_client = BlobClient.from_connection_string(conn_str, container_name=cont_name, blob_name=blob_name)
with open(file, "rb") as f:
blob_client.upload_blob(f, overwrite=True)
print("Upload Success!")
except Exception as ex:
raise UploadError(repr(ex))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--source-folder", "-s", type=str, required=True, help="path to source folder")
parser.add_argument("--zipfile-name", "-z", type=str, required=True, help="name of the zip file")
parser.add_argument("--connection-string", "-c", type=str, help="connection string")
parser.add_argument("--container-name", "-n", type=str, help="container name")
parser.add_argument("--remove-zipfile", "-r", action="store_true", help="whether delete the zip file after uploading")
args = parser.parse_args()
try:
zip_file(args.source_folder, args.zipfile_name)
upload_to_blob(args.zipfile_name, args.connection_string, args.container_name, args.zipfile_name)
except ZipError as ex:
print(f"Failed to compress files. {repr(ex)}")
sys.exit(-1)
except UploadError as ex:
print(f"Failed to upload files. {repr(ex)}")
sys.exit(-1)
except Exception as ex:
print(f"Exception encountered. {repr(ex)}")
try:
if args.remove_zipfile:
os.remove(args.zipfile_name)
except Exception as ex:
print(f"Failed to delete the zip file. {repr(ex)}")
3. Train an MVAD Model
Here is a sample request body and the sample code in Python to train an MVAD model.
// Sample Request Body
{
"slidingWindow": 200,
"alignPolicy": {
"alignMode": "Outer",
"fillNAMethod": "Linear",
"paddingValue": 0
},
// This could be your own ZIP file of training data stored on Azure Blob and a SAS url could be used here
"source": "https://aka.ms/AnomalyDetector/MVADSampleData",
"startTime": "2021-01-01T00:00:00Z",
"endTime": "2021-01-02T12:00:00Z",
"displayName": "Contoso model"
}
# Sample Code in Python
########### Python 3.x #############
import http.client, urllib.request, urllib.parse, urllib.error, base64
headers = {
# Request headers
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': '{API key}',
}
params = urllib.parse.urlencode({})
try:
conn = http.client.HTTPSConnection('{endpoint}')
conn.request("POST", "/anomalydetector/v1.1-preview/multivariate/models?%s" % params, "{request body}", headers)
response = conn.getresponse()
data = response.read()
print(data)
conn.close()
except Exception as e:
print("[Errno {0}] {1}".format(e.errno, e.strerror))
####################################
Response code 201 indicates a successful request.
Input parameters
Required parameters
These three parameters are required in training and inference API requests:
source- The link to your zip file located in the Azure Blob Storage with Shared Access Signatures (SAS).startTime- The start time of data used for training or inference. If it's earlier than the actual earliest timestamp in the data, the actual earliest timestamp will be used as the starting point.endTime- The end time of data used for training or inference which must be later than or equal tostartTime. IfendTimeis later than the actual latest timestamp in the data, the actual latest timestamp will be used as the ending point. IfendTimeequals tostartTime, it means inference of one single data point which is often used in streaming scenarios.
Optional parameters for training API
Other parameters for training API are optional:
slidingWindow- How many data points are used to determine anomalies. An integer between 28 and 2,880. The default value is 300. IfslidingWindowiskfor model training, then at leastkpoints should be accessible from the source file during inference to get valid results.MVAD takes a segment of data points to decide if the next data point is an anomaly. The length of the segment is
slidingWindow. Please keep two things in mind when choosing aslidingWindowvalue:- The properties of your data: whether it's periodic and the sampling rate. When your data is periodic, you could set the length of 1 - 3 cycles as the
slidingWindow. When your data is at a high frequency (small granularity) like minute-level or second-level, you could set a relatively higher value ofslidingWindow. - The trade-off between training/inference time and potential performance impact. A larger
slidingWindowmay cause longer training/inference time. There is no guarantee that largerslidingWindows will lead to accuracy gains. A smallslidingWindowmay cause the model difficult to converge to an optimal solution. For example, it is hard to detect anomalies whenslidingWindowhas only two points.
- The properties of your data: whether it's periodic and the sampling rate. When your data is periodic, you could set the length of 1 - 3 cycles as the
alignMode- How to align multiple variables (time series) on timestamps. There are two options for this parameter,InnerandOuter, and the default value isOuter.This parameter is critical when there is misalignment between timestamp sequences of the variables. The model needs to align the variables onto the same timestamp sequence before further processing.
Innermeans the model will report detection results only on timestamps on which every variable has a value, i.e. the intersection of all variables.Outermeans the model will report detection results on timestamps on which any variable has a value, i.e. the union of all variables.Here is an example to explain different
alignModelvalues.Variable-1
timestamp value 2020-11-01 1 2020-11-02 2 2020-11-04 4 2020-11-05 5 Variable-2
timestamp value 2020-11-01 1 2020-11-02 2 2020-11-03 3 2020-11-04 4 Innerjoin two variablestimestamp Variable-1 Variable-2 2020-11-01 1 1 2020-11-02 2 2 2020-11-04 4 4 Outerjoin two variablestimestamp Variable-1 Variable-2 2020-11-01 1 1 2020-11-02 2 2 2020-11-03 nan3 2020-11-04 4 4 2020-11-05 5 nanfillNAMethod- How to fillnanin the merged table. There might be missing values in the merged table and they should be properly handled. We provide several methods to fill them up. The options areLinear,Previous,Subsequent,Zero, andFixedand the default value isLinear.Option Method LinearFill nanvalues by linear interpolationPreviousPropagate last valid value to fill gaps. Example: [1, 2, nan, 3, nan, 4]->[1, 2, 2, 3, 3, 4]SubsequentUse next valid value to fill gaps. Example: [1, 2, nan, 3, nan, 4]->[1, 2, 3, 3, 4, 4]ZeroFill nanvalues with 0.FixedFill nanvalues with a specified valid value that should be provided inpaddingValue.paddingValue- Padding value is used to fillnanwhenfillNAMethodisFixedand must be provided in that case. In other cases it is optional.displayName- This is an optional parameter which is used to identify models. For example, you can use it to mark parameters, data sources, and any other meta data about the model and its input data. The default value is an empty string.
4. Get model status
As the training API is asynchronous, you won't get the model immediately after calling the training API. However, you can query the status of models either by API key, which will list all the models, or by model ID, which will list information about the specific model.
List all the models
You may refer to this page for information about the request URL and request headers. Notice that we only return 10 models ordered by update time, but you can visit other models by setting the $skip and the $top parameters in the request URL. For example, if your request URL is https://{endpoint}/anomalydetector/v1.1-preview/multivariate/models?$skip=10&$top=20, then we will skip the latest 10 models and return the next 20 models.
A sample response is
{
"models": [
{
"createdTime":"2020-12-01T09:43:45Z",
"displayName":"DevOps-Test",
"lastUpdatedTime":"2020-12-01T09:46:13Z",
"modelId":"b4c1616c-33b9-11eb-824e-0242ac110002",
"status":"READY",
"variablesCount":18
},
{
"createdTime":"2020-12-01T09:43:30Z",
"displayName":"DevOps-Test",
"lastUpdatedTime":"2020-12-01T09:45:10Z",
"modelId":"ab9d3e30-33b9-11eb-a3f4-0242ac110002",
"status":"READY",
"variablesCount":18
}
],
"currentCount": 1,
"maxCount": 50,
"nextLink": "<link to more models>"
}
The response contains 4 fields, models, currentCount, maxCount, and nextLink.
modelscontains the created time, last updated time, model ID, display name, variable counts, and the status of each model.currentCountcontains the number of trained multivariate models.maxCountis the maximum number of models supported by this Anomaly Detector resource.nextLinkcould be used to fetch more models.
Get models by model ID
This page describes the request URL to query model information by model ID. A sample response looks like this
{
"modelId": "45aad126-aafd-11ea-b8fb-d89ef3400c5f",
"createdTime": "2020-06-30T00:00:00Z",
"lastUpdatedTime": "2020-06-30T00:00:00Z",
"modelInfo": {
"slidingWindow": 300,
"alignPolicy": {
"alignMode": "Outer",
"fillNAMethod": "Linear",
"paddingValue": 0
},
"source": "<TRAINING_ZIP_FILE_LOCATED_IN_AZURE_BLOB_STORAGE_WITH_SAS>",
"startTime": "2019-04-01T00:00:00Z",
"endTime": "2019-04-02T00:00:00Z",
"displayName": "Devops-MultiAD",
"status": "READY",
"errors": [],
"diagnosticsInfo": {
"modelState": {
"epochIds": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
"trainLosses": [0.6291328072547913, 0.1671326905488968, 0.12354248017072678, 0.1025966405868533,
0.0958492755889896, 0.09069952368736267,0.08686016499996185, 0.0860302299260931,
0.0828735455870684, 0.08235538005828857],
"validationLosses": [1.9232804775238037, 1.0645641088485718, 0.6031560301780701, 0.5302737951278687,
0.4698025286197664, 0.4395163357257843, 0.4182931482799006, 0.4057914316654053,
0.4056498706340729, 0.3849248886108984],
"latenciesInSeconds": [0.3398594856262207, 0.3659665584564209, 0.37360644340515137,
0.3513407707214355, 0.3370304107666056, 0.31876277923583984,
0.3283309936523475, 0.3503587245941162, 0.30800247192382812,
0.3327946662902832]
},
"variableStates": [
{
"variable": "ad_input",
"filledNARatio": 0,
"effectiveCount": 1441,
"startTime": "2019-04-01T00:00:00Z",
"endTime": "2019-04-02T00:00:00Z",
"errors": []
},
{
"variable": "ad_ontimer_output",
"filledNARatio": 0,
"effectiveCount": 1441,
"startTime": "2019-04-01T00:00:00Z",
"endTime": "2019-04-02T00:00:00Z",
"errors": []
},
// More variables
]
}
}
}
You will receive more detailed information about the queried model. The response contains meta information about the model, its training parameters, and diagnostic information. Diagnostic Information is useful for debugging and tracing training progress.
epochIdsindicates how many epochs the model has been trained out of in total 100 epochs. For example, if the model is still in training status,epochIdmight be[10, 20, 30, 40, 50]which means that it has completed its 50th training epoch, and there are half way to go.trainLossesandvalidationLossesare used to check whether the optimization progress converges in which case the two losses should decrease gradually.latenciesInSecondscontains the time cost for each epoch and is recorded every 10 epochs. In this example, the 10th epoch takes approximately 0.34 seconds. This would be helpful to estimate the completion time of training.variableStatessummarizes information about each variable. It is a list ranked byfilledNARatioin descending order. It tells how many data points are used for each variable andfilledNARatiotells how many points are missing. Usually we need to reducefilledNARatioas much as possible. Too many missing data points will deteriorate model accuracy.- Errors during data processing will be included in the
errorsfield.
5. Inference with MVAD
To perform inference, simply provide the blob source to the zip file containing the inference data, the start time, and end time.
Inference is also asynchronous, so the results are not returned immediately. Notice that you need to save in a variable the link of the results in the response header which contains the resultId, so that you may know where to get the results afterwards.
Failures are usually caused by model issues or data issues. You cannot perform inference if the model is not ready or the data link is invalid. Make sure that the training data and inference data are consistent, which means they should be exactly the same variables but with different timestamps. More variables, fewer variables, or inference with a different set of variables will not pass the data verification phase and errors will occur. Data verification is deferred so that you will get error message only when you query the results.
6. Get inference results
You need the resultId to get results. resultId is obtained from the response header when you submit the inference request. This page contains instructions to query the inference results.
A sample response looks like this
{
"resultId": "663884e6-b117-11ea-b3de-0242ac130004",
"summary": {
"status": "READY",
"errors": [],
"variableStates": [
{
"variable": "ad_input",
"filledNARatio": 0,
"effectiveCount": 26,
"startTime": "2019-04-01T00:00:00Z",
"endTime": "2019-04-01T00:25:00Z",
"errors": []
},
{
"variable": "ad_ontimer_output",
"filledNARatio": 0,
"effectiveCount": 26,
"startTime": "2019-04-01T00:00:00Z",
"endTime": "2019-04-01T00:25:00Z",
"errors": []
},
// more variables
],
"setupInfo": {
"source": "https://aka.ms/AnomalyDetector/MVADSampleData",
"startTime": "2019-04-01T00:15:00Z",
"endTime": "2019-04-01T00:40:00Z"
}
},
"results": [
{
"timestamp": "2019-04-01T00:15:00Z",
"errors": [
{
"code": "InsufficientHistoricalData",
"message": "historical data is not enough."
}
]
},
// more results
{
"timestamp": "2019-04-01T00:20:00Z",
"value": {
"contributors": [],
"isAnomaly": false,
"severity": 0,
"score": 0.17805261260751692
}
},
// more results
{
"timestamp": "2019-04-01T00:27:00Z",
"value": {
"contributors": [
{
"contributionScore": 0.0007775013367514271,
"variable": "ad_ontimer_output"
},
{
"contributionScore": 0.0007989604079048129,
"variable": "ad_series_init"
},
{
"contributionScore": 0.0008900927229851369,
"variable": "ingestion"
},
{
"contributionScore": 0.008068144477478554,
"variable": "cpu"
},
{
"contributionScore": 0.008222036467507165,
"variable": "data_in_speed"
},
{
"contributionScore": 0.008674941549594993,
"variable": "ad_input"
},
{
"contributionScore": 0.02232242629793674,
"variable": "ad_output"
},
{
"contributionScore": 0.1583773213660846,
"variable": "flink_last_ckpt_duration"
},
{
"contributionScore": 0.9816531517495176,
"variable": "data_out_speed"
}
],
"isAnomaly": true,
"severity": 0.42135109874230336,
"score": 1.213510987423033
}
},
// more results
]
}
The response contains the result status, variable information, inference parameters, and inference results.
variableStateslists the information of each variable in the inference request.setupInfois the request body submitted for this inference.resultscontains the detection results. There're three typical types of detection results.- Error code
InsufficientHistoricalData. This usually happens only with the first few timestamps because the model inferences data in a window-based manner and it needs historical data to make a decision. For the first few timestamps, there is insufficient historical data, so inference cannot be performed on them. In this case, the error message can be ignored. "isAnomaly": falseindicates the current timestamp is not an anomaly.severityindicates the relative severity of the anomaly and for normal data it is always 0.scoreis the raw output of the model on which the model makes a decision which could be non-zero even for normal data points.
"isAnomaly": trueindicates an anomaly at the current timestamp.severityindicates the relative severity of the anomaly and for abnormal data it is always greater than 0.scoreis the raw output of the model on which the model makes a decision.severityis a derived value fromscore. Every data point has ascore.contributorsis a list containing the contribution score of each variable. Higher contribution scores indicate higher possibility of the root cause. This list is often used for interpreting anomalies as well as diagnosing the root causes.
- Error code
Note
A common pitfall is taking all data points with isAnomaly=true as anomalies. That may end up with too many false positives.
You should use both isAnomaly and severity (or score) to sift out anomalies that are not severe and (optionally) use grouping to check the duration of the anomalies to suppress random noise.
Please refer to the FAQ in the best practices document for the difference between severity and score.