Get facial pose events

Note

Viseme events are only available for en-US English (United States) neural voices for now.

A viseme is the visual description of a phoneme in spoken language. It defines the position of the face and mouth when speaking a word. Each viseme depicts the key facial poses for a specific set of phonemes. Viseme can be used to control the movement of 2D and 3D avatar models, perfectly matching mouth movements to synthetic speech.

Visemes make avatars easier to use and control. Using visemes, you can:

  • Create an animated virtual voice assistant for intelligent kiosks, building multi-mode integrated services for your customers.
  • Build immersive news broadcasts and improve audience experiences with natural face and mouth movements.
  • Generate more interactive gaming avatars and cartoon characters that can speak with dynamic content.
  • Make more effective language teaching videos that help language learners to understand the mouth behavior of each word and phoneme.
  • People with hearing impairment can also pick up sounds visually and "lip-read" speech content that shows visemes on an animated face.

See the introduction video of the viseme.

Azure neural TTS can produce visemes with speech

A neural voice turns input text or SSML (Speech Synthesis Markup Language) into synthesized speech. Speech audio output can be accompanied by viseme IDs and their offset timestamps. Each viseme ID specifies a specific pose in observed speech, such as the position of the lips, jaw, and tongue when producing a particular phoneme. Using a 2D or 3D rendering engine, you can use these viseme events to animate your avatar.

The overall workflow of viseme is depicted in the flowchart below.

The overall workflow of viseme

Parameter Description
Viseme ID Integer number that specifies a viseme. In English (United States), we offer 22 different visemes to depict the mouth shapes for a specific set of phonemes. There is no one-to-one correspondence between visemes and phonemes. Often several phonemes correspond to a single viseme, as several phonemes look the same on the face when produced, such as s and z. See the mapping table between Viseme ID and phonemes.
Audio offset The start time of each viseme, in ticks (100 nanoseconds).

Get viseme events with the Speech SDK

To get viseme with your synthesized speech, subscribe to the VisemeReceived event in Speech SDK. The following snippet shows how to subscribe to the viseme event.

using (var synthesizer = new SpeechSynthesizer(speechConfig, audioConfig))
{
    // Subscribes to viseme received event
    synthesizer.VisemeReceived += (s, e) =>
    {
        Console.WriteLine($"Viseme event received. Audio offset: " +
            $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}.");
    };

    var result = await synthesizer.SpeakSsmlAsync(ssml));
}

auto synthesizer = SpeechSynthesizer::FromConfig(speechConfig, audioConfig);

// Subscribes to viseme received event
synthesizer->VisemeReceived += [](const SpeechSynthesisVisemeEventArgs& e)
{
    cout << "viseme event received. "
        // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
        << "Audio offset: " << e.AudioOffset / 10000 << "ms, "
        << "viseme id: " << e.VisemeId << "." << endl;
};

auto result = synthesizer->SpeakSsmlAsync(ssml).get();
SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);

// Subscribes to viseme received event
synthesizer.VisemeReceived.addEventListener((o, e) -> {
    // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
    System.out.print("Viseme event received. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
    System.out.println("viseme id: " + e.getVisemeId() + ".");
});

SpeechSynthesisResult result = synthesizer.SpeakSsmlAsync(ssml).get();
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

# Subscribes to viseme received event
speech_synthesizer.viseme_received.connect(lambda evt: print(
    "Viseme event received: audio offset: {}ms, viseme id: {}.".format(evt.audio_offset / 10000, evt.viseme_id)))

result = speech_synthesizer.speak_ssml_async(ssml).get()
var synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);

// Subscribes to viseme received event
synthesizer.visemeReceived = function (s, e) {
    window.console.log("(Viseme), Audio offset: " + e.audioOffset / 10000 + "ms. Viseme ID: " + e.visemeId);
}

synthesizer.speakSsmlAsync(ssml);
SPXSpeechSynthesizer *synthesizer =
    [[SPXSpeechSynthesizer alloc] initWithSpeechConfiguration:speechConfig
                                           audioConfiguration:audioConfig];

// Subscribes to viseme received event
[synthesizer addVisemeReceivedEventHandler: ^ (SPXSpeechSynthesizer *synthesizer, SPXSpeechSynthesisVisemeEventArgs *eventArgs) {
    NSLog(@"Viseme event received. Audio offset: %fms, viseme id: %lu.", eventArgs.audioOffset/10000., eventArgs.visemeId);
}];

[synthesizer speakSsml:ssml];

Here is an example of the viseme output.

(Viseme), Viseme ID: 1, Audio offset: 200ms.

(Viseme), Viseme ID: 5, Audio offset: 850ms.

……

(Viseme), Viseme ID: 13, Audio offset: 2350ms.

After obtaining the viseme output, you can use these events to drive character animation. You can build your own characters and automatically animate the characters.

For 2D characters, you can design a character that suits your scenario and use Scalable Vector Graphics (SVG) for each viseme ID to get a time-based face position. With temporal tags provided in viseme event, these well-designed SVGs will be processed with smoothing modifications, and provide robust animation to the users. For example, below illustration shows a red lip character designed for language learning.

2D render example

For 3D characters, think of the characters as string puppets. The puppet master pulls the strings from one state to another and the laws of physics do the rest and drive the puppet to move fluidly. The viseme output acts as a puppet master to provide an action timeline. The animation engine defines the physical laws of action. By interpolating frames with easing algorithms, the engine can further generate high-quality animations.

Map phonemes to visemes

Visemes vary by language. Each language has a set of visemes that correspond to its specific phonemes. The following table shows the correspondence between International Phonetic Alphabet (IPA) phonemes and viseme IDs for English (United States).

IPA Example Viseme ID
i eat 6
ɪ if 6
ate 4
ɛ every 4
æ active 1
ɑ obstinate 2
ɔ cause 3
ʊ book 4
old 8
u Uber 7
ʌ uncle 1
ice 11
out 9
ɔɪ oil 10
ju Yuma [6, 7]
ə ago 1
ɪɹ ears [6, 13]
ɛɹ airplane [4, 13]
ʊɹ cure [4, 13]
aɪ(ə)ɹ Ireland [11, 13]
aʊ(ə)ɹ hours [9, 13]
ɔɹ orange [3, 13]
ɑɹ artist [2, 13]
ɝ earth [5, 13]
ɚ allergy [1, 13]
w with, suede 7
j yard, few 6
p put 21
b big 21
t talk 19
d dig 19
k cut 20
g go 20
m mat, smash 21
n no, snow 19
ŋ link 20
f fork 18
v value 18
θ thin 17
ð then 17
s sit 15
z zap 15
ʃ she 16
ʒ Jacques 16
h help 12
chin 16
joy 16
l lid, glad 14
ɹ red, bring 13

Next steps