Redact faces with Azure Media Analytics

项目
10/05/2022

Overview

Azure Media Redactor is an Azure Media Analytics media processor (MP) that offers scalable face redaction in the cloud. Face redaction enables you to modify your video in order to blur faces of selected individuals. You may want to use the face redaction service in public safety and news media scenarios. A few minutes of footage that contains multiple faces can take hours to redact manually, but with this service the face redaction process will require just a few simple steps.

This article gives details about Azure Media Redactor and shows how to use it with Media Services SDK for .NET.

Face redaction modes

Facial redaction works by detecting faces in every frame of video and tracking the face object both forwards and backwards in time, so that the same individual can be blurred from other angles as well. The automated redaction process is complex and does not always produce 100% of desired output, for this reason Media Analytics provides you with a couple of ways to modify the final output.

In addition to a fully automatic mode, there is a two-pass workflow, which allows the selection/de-selection of found faces via a list of IDs. Also, to make arbitrary per frame adjustments the MP uses a metadata file in JSON format. This workflow is split into Analyze and Redact modes. You can combine the two modes in a single pass that runs both tasks in one job; this mode is called Combined.

Note

Face Detector Media Processor has been deprecated as of June 2020, Azure Media Services legacy components. Consider using Azure Media Services v3 API. There is no planned replacement for the China region.

Combined mode

This produces a redacted mp4 automatically without any manual input.

Stage	File Name	Notes
Input asset	foo.bar	Video in WMV, MOV, or MP4 format
Input config	Job configuration preset	{'version':'1.0', 'options': {'mode':'combined'}}
Output asset	foo_redacted.mp4	Video with blurring applied

Analyze mode

The analyze pass of the two-pass workflow takes a video input and produces a JSON file of face locations, and jpg images of each detected face.

Stage	File Name	Notes
Input asset	foo.bar	Video in WMV, MPV, or MP4 format
Input config	Job configuration preset	{'version':'1.0', 'options': {'mode':'analyze'}}
Output asset	foo_annotations.json	Annotation data of face locations in JSON format. This can be edited by the user to modify the blurring bounding boxes. See sample below.
Output asset	foo_thumb%06d.jpg [foo_thumb000001.jpg, foo_thumb000002.jpg]	A cropped jpg of each detected face, where the number indicates the labelId of the face

Output example

{
  "version": 1,
  "timescale": 24000,
  "offset": 0,
  "framerate": 23.976,
  "width": 1280,
  "height": 720,
  "fragments": [
    {
      "start": 0,
      "duration": 48048,
      "interval": 1001,
      "events": [
        [],
        [],
        [],
        [],
        [],
        [],
        [],
        [],
        [],
        [],
        [],
        [],
        [],
        [
          {
            "index": 13,
            "id": 1138,
            "x": 0.29537,
            "y": -0.18987,
            "width": 0.36239,
            "height": 0.80335
          },
          {
            "index": 13,
            "id": 2028,
            "x": 0.60427,
            "y": 0.16098,
            "width": 0.26958,
            "height": 0.57943
          }
        ],

    ... truncated

Redact mode

The second pass of the workflow takes a larger number of inputs that must be combined into a single asset.

This includes a list of IDs to blur, the original video, and the annotations JSON. This mode uses the annotations to apply blurring on the input video.

The output from the Analyze pass does not include the original video. The video needs to be uploaded into the input asset for the Redact mode task and selected as the primary file.

Stage	File Name	Notes
Input asset	foo.bar	Video in WMV, MPV, or MP4 format. Same video as in step 1.
Input asset	foo_annotations.json	annotations metadata file from phase one, with optional modifications.
Input asset	foo_IDList.txt (Optional)	Optional new line separated list of face IDs to redact. If left blank, this blurs all faces.
Input config	Job configuration preset	{'version':'1.0', 'options': {'mode':'redact'}}
Output asset	foo_redacted.mp4	Video with blurring applied based on annotations

Example output

This is the output from an IDList with one ID selected.

Example foo_IDList.txt

1
2
3

Blur types

In the Combined or Redact mode, there are 5 different blur modes you can choose from via the JSON input configuration: Low, Med, High, Box, and Black. By default Med is used.

You can find samples of the blur types below.

Example JSON

{
    'version':'1.0',
    'options': {
        'Mode': 'Combined',
        'BlurType': 'High'
    }
}

Low

Med

High

Box

Black

Elements of the output JSON file

The Redaction MP provides high precision face location detection and tracking that can detect up to 64 human faces in a video frame. Frontal faces provide the best results, while side faces and small faces (less than or equal to 24x24 pixels) are challenging.

The job produces a JSON output file that contains metadata about detected and tracked faces. The metadata includes coordinates indicating the location of faces, as well as a face ID number indicating the tracking of that individual. Face ID numbers are prone to reset under circumstances when the frontal face is lost or overlapped in the frame, resulting in some individuals getting assigned multiple IDs.

The output JSON includes the following elements:

Root JSON elements

Element	Description
version	This refers to the version of the Video API.
timescale	"Ticks" per second of the video.
offset	This is the time offset for timestamps. In version 1.0 of Video APIs, this will always be 0. In future scenarios we support, this value may change.
width, hight	The width and hight of the output video frame, in pixels.
framerate	Frames per second of the video.

Fragments JSON elements

Element	Description
start	The start time of the first event in "ticks."
duration	The length of the fragment, in “ticks.”
index	(Applies to Azure Media Redactor only) defines the frame index of the current event.
interval	The interval of each event entry within the fragment, in “ticks.”
events	Each event contains the faces detected and tracked within that time duration. It is an array of events. The outer array represents one interval of time. The inner array consists of 0 or more events that happened at that point in time. An empty bracket [] means no faces were detected.
id	The ID of the face that is being tracked. This number may inadvertently change if a face becomes undetected. A given individual should have the same ID throughout the overall video, but this cannot be guaranteed due to limitations in the detection algorithm (occlusion, etc.).
x, y	The upper left X and Y coordinates of the face bounding box in a normalized scale of 0.0 to 1.0. -X and Y coordinates are relative to landscape always, so if you have a portrait video (or upside-down, in the case of iOS), you'll have to transpose the coordinates accordingly.
width, height	The width and height of the face bounding box in a normalized scale of 0.0 to 1.0.
facesDetected	This is found at the end of the JSON results and summarizes the number of faces that the algorithm detected during the video. Because the IDs can be reset inadvertently if a face becomes undetected (e.g., the face goes off screen, looks away), this number may not always equal the true number of faces in the video.