Use cases for Speaker Recognition

What is a Transparency Note?

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it's deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance.

Microsoft provides Transparency Notes to help you understand how our AI technology works. They include the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.

Transparency Notes are part of a broader effort at Microsoft to put our AI principles into practice. To find out more, see Microsoft's AI principles.

Introduction to Speaker Recognition

Speaker Recognition is an AI service that can identify an individual speaking in an audio clip. The human voice has unique characteristics that can be associated with an individual. Speaker Recognition can recognize speakers by comparing the unique voice characteristics of incoming speech with registered voice signatures. For more information, see Speaker Recognition.

The basics of Speaker Recognition

Speaker Recognition capabilities are provided through two APIs:

  • Speaker Verification allows you to determine scenarios like “Is this Anna speaking?”. It verifies an individual’s identity by comparing the voice characteristics of their speech with the registered voice signature of the claimed identity.

    Diagram that shows how Speaker Verification works.

  • Speaker Identification allows you to determine scenarios like “Who is speaking, Anna, Isha, or Jing?”. It attributes speech to individual speakers within a group of enrolled individuals.

Term Definition
Voice signature Also called template or voiceprint. It's a numeric vector that represents an individual’s voice characteristics, extracted from audio recordings of a person speaking. The original audio recordings can't be interpreted or reconstructed based on a voice signature. Voice signature quality is a key determinant of how accurate your results are.
Enrollment Enrollment is the process of creating voice signatures from the audio files of individuals’ speech, so they can be recognized at a later time. When a person is enrolled to a recognition system, that person's template is also associated with a primary identifier1 that will be used to determine which voice signature to compare with the recognition speech input.
Recognition During Recognition, audio of a person speaking is compared against one or more voice signatures. The process is called verification if the audio is compared with one specific voice signature. It's called identification if the audio is compared with more than one voice signatures to identify the speaker.
Text-dependent speaker verification Also called active verification. The speaker chooses a specific passphrase (set of words) to be spoken during both enrollment and verification phases. During verification, the system recognizes the passphrase text and compares it with the enrollment passphrase. The result is based on both voice signature match and passphrase match.
Passphrase signature In the enrollment audio of text-dependent APIs, the chosen passphrase is recognized to text. Then both the voice signature and the passphrase text are stored. The unique passphrase, such as "My voice is my passport verify me," is called a passphrase signature. The passphrase signature is also compared with the text of speech audio input during recognition.
Text-independent speaker verification Also called passive verification. Speakers aren't required to speak pre-defined words, instead speakers can use any phrase. The voice signature is used during verification, but the speech content isn't considered. During recognition, speakers don't necessarily need to use the same phrase they did during enrollment. Longer audio recordings are recommended during enrollment to achieve reliable performance
Activation phrase It is a predefined phrase that the speaker has to read in the beginning of the enrollment when using text-independent APIs. Although speakers can use any phrases during the recognition process in the text-independent speaker verification or identification, Microsoft requires the speaker to read this activation phrase first. After the activation step, the speaker can continue enrollment by using any phrases.

1 Developers can associate the GUID (globally unique identifiers) generated by Microsoft with an individual’s primary identifier to support verification of that individual. Speaker Recognition service doesn't store primary identifiers, such as customer IDs, with voice signatures. Instead, Microsoft associates stored voice signatures with random GUIDs.

Use cases

Here are some examples of when you might use Speaker Recognition:

  • Customer identity verification: Call center or interactive voice response systems can use speaker verification as an extra security measure, when coupled with a phone number, PIN, or other type of authentication data. This helps to verify a customer’s identity when they request access to information or to conduct transactions.
  • Smart device personalization: Voice-enabled interaction devices, such as smart vehicles or smart speakers, can use Speaker Recognition to provide personalized content. For example, you can play movies or music in response to voice commands by using the text-independent Speaker Verification API.
  • Multi-factor authentication: A multi-factor authentication system can use voice as one factor to enhance security. For example, you could approve or deny employee access to secured facility by using both the Azure Face service and the text-dependent Speaker Verification API.
  • Speaker identification for meetings: The Speaker Identification API can be used to identify individual speakers as part of meeting transcription. In the transcription of audio from a meeting, speech can be attributed to the correct speaker or “guest” if no match is found. In this scenario, input audio must be separated by speaker before using Speaker Identification API.

Considerations when choosing a use case

Consider the following factors when you choose a use case:

  • Avoid using for recognizing multiple speakers in a speech input: Speaker Recognition can't recognize more than one person in a single speech input. The service is intended to take in one person’s speech input and compare it to one or more voice signatures.
  • Avoid using as a sole factor in authentication where security is important: Speaker Recognition is not designed to differentiate a synthesized voice or recordings of a voice from a live human speaker. Carefully consider scenarios with a risk of spoofing. Speaker Recognition shouldn't be used as the sole factor in authenticating a user in applications where security is the goal, such as access to financial information or physical security.
  • Ensure active enrollment of the users: Voice signatures contain speakers' biometric voiceprint characteristics. To help prevent misuse of this service, Microsoft requires the customer to have active enrollment of users of text-independent APIs through its activation step. The activation step indicates the speakers' active participation in creating their voice signatures and is intended to help avoid the scenario in which speakers are enrolled without their awareness. Be advised that this mandatory activation step does not alleviate customer’s legal obligations to ensure it has received all necessary permissions and consents from its users for the purposes of its processing, retention, and intended uses of speaker signatures created.
  • Limit the number of candidates for speaker identification: Speaker Identification API can only take up to 50 candidates to compare the speech input against in an API call.

Next steps