Disclosure for voice talent

About this note

The goal of this note is to help voice talent understand the technology behind the text-to-speech (TTS) capabilities that their voices help create.

Microsoft is committed to designing AI responsibly. We hope this note will foster a greater shared understanding among tech builders, voice talent, and the general public about the intended and beneficial uses of this technology.

Key TTS terms

  • Voice font: A text-to-speech computer model that can mimic unique vocal characteristics of a target speaker.
  • Voice talent: Individuals whose voices are recorded and synthesized to create voice fonts.

Two broad categories of text-to-speech (TTS)

Standard TTS

How it works: The standard, or "traditional," method of TTS breaks down spoken language into phonetic snippets that can be remixed and matched using classical programming or statistical methods.

What to know about it: Standard TTS requires a large volume of voice data—in the range of 10,000 lines or more—to produce a more human-like sounding voice font. With fewer recorded lines, a standard TTS voice font will tend to sound more obviously robotic.

Examples of how Microsoft uses it:

  • Platform Voice is a feature of the Speech Service on Azure that offers "off-the-shelf" voice fonts for customers and developers to use. Platform Voices are also used in several Microsoft products including the Edge Browser, Narrator, Office, and Teams.
  • Custom Voice is a feature of the Speech Service on Azure that allows customers to build a synthetic voice font using recordings from a voice talent to represent a specific persona for a corporation/enterprise.
  • Windows system voices are included in the Windows operating system. They are used in several applications such as Narrator, Cortana, Edge Read Aloud, and Teams.

What to expect when recording : Contributing at least 6,000 lines to produce a good quality voice font.

Neural TTS

How it works: Neural TTS synthesizes speech using deep neural networks that have "learned" the way phonetics are combined in natural human speech rather than using classical programming or statistical methods. In addition to the recordings of a target voice talent, neural TTS uses a source library that contains voice recordings from many different speakers.

What to know about it: Because of the way it synthesizes voices, neural TTS can produce styles of speech that weren't part of the original recordings, such as changes in tone of voice and affectation. Neural TTS voices sound fluid and are good at replicating the natural pauses, idiosyncrasies, and hesitancy that people express when they're speaking. Those who hear synthetic voices made via neural TTS tend to rate them closer to human speech than standard TTS voices.

Examples of how Microsoft uses it:

  • Platform Voice is a feature of the Speech Service on Azure that offers "off-the-shelf" voice fonts for customers and developers to use. Platform Voices are also used in several Microsoft products including the Edge Browser, Narrator, Office, and Teams.

  • Custom Neural Voice is a feature of the Speech Service on Azure that allows customers to create a one-of-a-kind custom synthetic voice font for their brand. The following capabilities are used to produce Custom Neural Voices:

    • Language transfer can express in a language different from the original voice recordings.
    • Style transfer can express in a style of speaking different from the original voice recordings. For example, a newscaster voice.
    • Voice transformation canexpress in a manner different from the original voice recordings. For example, modifying tone or pitch to create different character voices.
  • Other voices used in Microsoft's products and services , such as Cortana and Xiaoice.

What to expect when recording: Contributing at least 500 lines for a proof of concept voice font and at least 2,000 lines to produce a new voice font.

Voice talent and synthetic voices: an evolving relationship

Recognizing the integral relationship between voice talent and synthetic voices, Microsoft interviewed voice talent to better understand their perspectives on new developments in the technology. Our 2019 study showed that voice talent saw potential benefit from the capabilities introduced by neural TTS, such as saving studio time to complete recording jobs, and adding capacity to complete more voice acting assignments. At the same time, there were varying degrees of awareness about how developments in TTS technology could potentially impact their profession.

Overall, voice talent expressed a desire for transparency and clarity about:

  • Limits on what their voice likeness could and could not be used to express.
  • The duration of allowable use of their voice likeness.
  • Potential impact on future recording opportunities.
  • The persona that would be associated with their voice likeness.

Synthetic voice in wider use

Traditionally, TTS systems were somewhat limited in adoption due to their robotic sound. Most were used to support accessibility—for example as a screen reader for people who are Blind or have low vision. TTS has also been used by people with a speech impairment. For instance, the late Stephen Hawking used a TTS-generated voice.

Now, with increasingly realistic-sounding synthetic voices and the uptick in more familiar, everyday interactions between machines and humans, the uses of this technology have proliferated and expanded. TTS systems power voice assistants across an array of devices and applications. They read out news, search results, public service announcements, educational content, and much more.

Microsoft's approach to responsible use of TTS

Every day, people find new ways to apply TTS technology, and not all are for the good of individuals or society. If misused, believably human-sounding TTS voices, especially a custom voice that mimics a real person, could cause harm. For example, a misinformation campaign could become much more potent if it used the voice of a well-known public figure.

We recognize that there's no perfect way to prevent media from being modified or to unequivocally prove where it came from. Therefore, our approach to responsible use has focused on being transparent about neural TTS, evaluating appropriate use, and demonstrating our values through action.

We ask customers to do the following:

  • Obtain explicit written permission from voice talent to use that person's voice for the purpose of creating a custom voice.
  • Provide this document to voice talent so they can understand how TTS works, and how it may be used once they complete the audio recording process.
  • Share the intended contexts of use with voice talent so they are aware of who will hear their voice, in what scenarios, and whether/how people will be able to interact with it.
  • Ensure voice talent are aware that a voice font made from their recordings can say things they didn't specifically record in the studio.
  • Discuss whether there's anything they'd be uncomfortable with the voice font being used to say.

We will continue to identify and be explicit about the intentional, beneficial, and intended uses of TTS that are based upon existing social norms and expectations people have around media when they believe it to be real or fake.

Guidelines for responsible deployment

Because TTS is an adaptable technology, there are grey areas in determining how it should or shouldn't be used. To navigate these, we've formulated the following guidelines for using synthetic voice fonts:

  • Protect owners of voices from misuse or identity theft.
  • Prevent the proliferation of fake and misleading content.
  • Encourage use in scenarios where consumers expect to be interacting with synthetic content.
  • Encourage use in scenarios where consumers observe the generation of the synthetic content.

Examples of inappropriate use

TTS must not be used to:

  • deceive people and/or intentionally misinform;
  • claim to be from any person, company, government body, or entity without explicit permission to make that representation and/or impersonate to gain unauthorized information or privileges;
  • create, incite, or disguise hate speech, discrimination, defamation, terrorism, or acts of violence;
  • exploit or manipulate children;
  • make unsolicited phone calls, bulk communications, posts, or messages;
  • disguise policy positions or political ideologies;
  • disseminate unattributed content or misrepresent the source

Examples of appropriate use

Appropriate TTS use cases could include, but are not limited to:

  • Virtual agents based on fictional personas (e.g., on-demand web searching, IoT control, or customer support provided by a company's branded character)
  • Entertainment media for use in fictional content (e.g., movies, video games, tv, recorded music, or audio books)
  • Accredited educational institutions or educational media (e.g., interactive lesson plans or guided museum tours)
  • Assistive technology and real-time translation (e.g., ALS-afflicted individuals preserving their voices)
  • Public service announcements using fictional personas (e.g., airport or train terminal announcements)

Reference docs

Learn more

Contact us

Give us feedback on this document: Find out about support options

About this document

(c)2020 Microsoft Corporation. All rights reserved. This document is provided "as-is." and for informational purposes only. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. Some examples are for illustration only and are fictitious. No real association is intended or inferred.