Azure Media Services REST v3 API enables you to analyze audio and video content. To analyze your content, you create a Transform and submit a Job that uses one of these presets: AudioAnalyzerPreset or VideoAnalyzerPreset.
AudioAnalyzerPreset enables you to extract multiple audio insights from an audio or video file. The output includes a JSON file (with all the insights) and VTT file for the audio transcript. This preset accepts a property that specifies the language of the input file in the form of a BCP47 string. The audio insights include:
Audio transcription – a transcript of the spoken words with timestamps. Multiple languages are supported
Speaker indexing – a mapping of the speakers and the corresponding spoken words
Speech sentiment analysis – the output of sentiment analysis performed on the audio transcription
Keywords – keywords that are extracted from the audio transcription.
VideoAnalyzerPreset enables you to extract multiple audio and video insights from a video file. The output includes a JSON file (with all the insights), a VTT file for the video transcript, and a collection of thumbnails. This preset also accepts a BCP47 string (representing the language of the video) as a property. The video insights include all the audio insights mentioned above and the following additional items:
Face tracking – the time during which faces are present in the video. Each face has a face id and a corresponding collection of thumbnails
Visual text – the text that is detected via optical character recognition. The text is time stamped and also used to extract keywords (in addition to the audio transcript)
Keyframes – a collection of keyframes that are extracted from the video
Visual content moderation – the portion of the videos that have been flagged as adult or racy in nature
Annotation – a result of annotating the videos based on a pre-defined object model
The output includes a JSON file (insights.json) with all the insights that were found in the video or audio. The json may contain the following elements:
The line ID.
The transcript itself.
The transcript language. Intended to support transcript where each line can have a different language.
A list of time ranges where this line appeared. If the instance is a transcript, it will have only 1 instance.