Disclosure for voice talent
The goal of this article is to help voice talent understand the technology behind the text-to-speech (TTS) capabilities that their voices help create. It also contains important privacy disclosures for voice talent about how Microsoft may process, use and retain audio files containing voice talent’s recorded statements and Custom Neural Voice voice models to help Microsoft prevent, and/or respond to complaints of, misuse of Cognitive Services or Custom Neural Voice services.
Microsoft is committed to designing AI responsibly. We hope this note will foster a greater shared understanding among tech builders, voice talent, and the general public about the intended and beneficial uses of this technology.
Key TTS terms
Voice model: A text-to-speech computer model that can mimic unique vocal characteristics of a target speaker. A voice model is also called as voice font or synthetic voice. A voice model is a set of parameters in binary format that is not human readable and does not contain audio recordings. It cannot be reverse engineered to derive or construct the audio recordings of a human being speaking.
Voice talent: Individuals or target speakers whose voices are recorded and used to create voice models that are intended to sound like the voice talent’s voice.
Two broad categories of text-to-speech (TTS)
Following are the two broad categories for text-to-speech (TTS)
How it works: The standard, or "traditional," method of TTS breaks down spoken language into phonetic snippets that can be remixed and matched using classical programming or statistical methods.
What to know about it: Standard TTS requires a large volume of voice data—in the range of 10,000 lines or more—to produce a more human-like voice model. With fewer recorded lines, a standard TTS voice model will tend to sound more obviously robotic.
Examples of how Microsoft uses it:
- Platform Voice is a feature of the Speech Service on Azure that offers "off-the-shelf" voice models for customers and developers to use. Platform Voices are also used in several Microsoft products including the Edge Browser, Narrator, Office, and Teams.
- Custom Voice is a feature of the Speech Service on Azure that allows customers to build a synthetic voice model using recordings from a voice talent to represent a specific persona for a corporation/enterprise.
- Microsoft and/or Windows system voicesare included in the Windows operating system. They are also used in several applications such as Narrator, Cortana, Edge Read Aloud, and Teams.
What to expect when recording : Contributing at least 6,000 lines to produce a good quality voice font.
How it works: Neural TTS synthesizes speech using deep neural networks that have "learned" the way phonetics are combined in natural human speech rather than using classical programming or statistical methods. In addition to the recordings of a target voice talent, neural TTS uses a source library that contains voice recordings from many different speakers.
What to know about it: Because of the way it synthesizes voices, neural TTS can produce styles of speech that weren't part of the original recordings, such as changes in tone of voice and affectation. Neural TTS voices sound fluid and are good at replicating the natural pauses, idiosyncrasies, and hesitancy that people express when they're speaking. Those who hear synthetic voices made via neural TTS tend to rate them closer to human speech than standard TTS voices.
Examples of how Microsoft uses it:
Platform Voice is a feature of the Speech Service on Azure that offers "off-the-shelf" voice models for customers and developers to use. Platform Voices are also used in several Microsoft products including the Edge Browser, Narrator, Office, and Teams.
Custom Neural Voice is a feature of the Speech Service on Azure that allows customers to create a one-of-a-kind custom synthetic voice model for their brand. The following capabilities are used to produce Custom Neural Voices:
- Language transfer can express in a language different from the original voice recordings.
- Style transfer can express in a style of speaking different from the original voice recordings. For example, a newscaster voice.
- Voice transformation can express in a manner different from the original voice recordings. For example, modifying tone or pitch to create different character voices.
Other voices used in Microsoft's products and services, such as Cortana.
What to expect when recording: Contributing at least 300 lines for a proof of concept voice model and about 2,000 lines to produce a new voice model for production use.
Voice talent and synthetic voices: an evolving relationship
Recognizing the integral relationship between voice talent and synthetic voices, Microsoft interviewed voice talent to better understand their perspectives on new developments in the technology. Research we conducted in 2019 showed that voice talent saw potential benefit from the capabilities introduced by neural TTS, such as saving studio time to complete recording jobs, and adding capacity to complete more voice acting assignments. At the same time, there were varying degrees of awareness about how developments in TTS technology could potentially impact their profession.
Overall, voice talent expressed a desire for transparency and clarity about:
- Limits on what their voice likeness could and could not be used to express.
- The duration of allowable use of their voice likeness.
- Potential impact on future recording opportunities.
- The persona that would be associated with their voice likeness.
Synthetic voice in wider use
Traditionally, TTS systems were somewhat limited in adoption due to their robotic sound. Most were used to support accessibility—for example as a screen reader for people who are Blind or have low vision. TTS has also been used by people with a speech impairment. For instance, the late Stephen Hawking used a TTS-generated voice.
Now, with increasingly realistic-sounding synthetic voices and the uptick in more familiar, everyday interactions between machines and humans, the uses of this technology have proliferated and expanded. TTS systems power voice assistants across an array of devices and applications. They read out news, search results, public service announcements, educational content, and much more.
Microsoft's approach to responsible use of TTS
Every day, people find new ways to apply TTS technology, and not all are for the good of individuals or society. If misused, believably human-sounding TTS voices, especially a custom voice that mimics a real person, could cause harm. For example, a misinformation campaign could become much more potent if it used the voice of a well-known public figure.
We recognize that there's no perfect way to prevent media from being modified or to unequivocally prove where it came from. Therefore, our approach to responsible use has focused on being transparent about neural TTS, evaluating appropriate use, and demonstrating our values through action.
Requirements and tips for meaningful consent from voice talent
We contractually require customers who use Custom Neural Voice to do the following:
- Obtain explicit written permission from voice talent to use that person's voice for the purpose of creating a custom voice.
- Provide this document to voice talent so they can understand how TTS works, and how it may be used once they complete the audio recording process.
- Get necessary permissions from voice talent for Microsoft’s processing, use and retention of voice talent’s audio files to perform speaker verification against training data and our use and retention of voice models as described below.
We recommend customers also do the following:
- Share the intended contexts of use with voice talent so they are aware of who will hear their voice, in what scenarios, and whether/how people will be able to interact with it.
- Ensure voice talent are aware that a voice model made from their recordings can say things they didn't specifically record in the studio.
- Discuss whether there's anything they'd be uncomfortable with the voice model being used to say.
Microsoft’s processing, use and retention of voice talent data
Microsoft’s use of Voice talent audio files for Speaker Verification
In addition to requiring customers to obtain permission from voice talents to create custom voice models, we require Cognitive Services customers to upload an audio file with a recorded statement from their voice talent acknowledging customer’s use of his/her voice to create a synthetic voice into Speech Studio (the Custom Neural Voice Training portal). Microsoft reserves the right to use Microsoft’s speaker recognition technology on this recorded statement and verify it against the training audio data in order to provide some assurance that the voices came from the same speaker or as otherwise necessary to investigate misuse of the services.
We require customers to obtain permission from voice talents for this biometric use. This technical safeguard is intended to help prevent misuse of our service, by, for example, preventing customers from training voice models with audio recordings and using it to spoof a voice without the speaker’s knowledge or consent.
The biometric speaker’s voice signatures created from the recorded statement files and training audio data are used by Microsoft solely for the purposes stated above. The voice signatures will not be retained after performing this verification from time to time, as necessary. Microsoft will retain the recorded statement file for as long as necessary in order to preserve the security and integrity of Microsoft’s Azure Cognitive Services. Learn more about how we process, use and retain this data in the Data and Privacy section.
Microsoft’s use of Custom Neural Voice models
While the customer maintains the exclusive rights to use the Custom Neural Voice model created at the Customer’s instruction, Microsoft may independently retain a copy of Custom Neural Voice models for as long as necessary and use it for the sole purpose of protecting the security and integrity of Microsoft Azure Cognitive Services.
Microsoft will secure and store a copy of Voice Talent’s recorded statement file(s) and Custom Neural Voice model(s) with the same high level security that it uses for its other Azure Services. Learn more at Microsoft Trust Center.
We will continue to identify and be explicit about the intentional, beneficial, and intended uses of TTS that are based upon existing social norms and expectations people have around media when they believe it to be real or fake. In line with Microsoft’s trust principles, Microsoft does not actively monitor or moderate the audio content generated by its Customers’ use Custom Neural Voice capability. Customer is solely responsible for ensuring its uses comply with all applicable laws and regulations and in accordance with the terms of its agreement with voice talent.
Guidelines for responsible deployment
Because TTS is an adaptable technology, there are grey areas in determining how it should or shouldn't be used. To navigate these, we've formulated the following guidelines for using synthetic voice models:
- Protect owners of voices from misuse or identity theft.
- Prevent the proliferation of fake and misleading content.
- Encourage use in scenarios where consumers expect to be interacting with synthetic content.
- Encourage use in scenarios where consumers observe the generation of the synthetic content.
Examples of inappropriate use
TTS must not be used to:
- Deceive people and/or intentionally misinform;
- Claim to be from any person, company, government body, or entity without explicit permission to make that representation and/or impersonate to gain unauthorized information or privileges;
- Create, incite, or disguise hate speech, discrimination, defamation, terrorism, or acts of violence;
- Exploit or manipulate children;
- Make unsolicited phone calls, bulk communications, posts, or messages;
- Disguise policy positions or political ideologies;
- Disseminate unattributed content or misrepresent the source
Examples of appropriate use
Appropriate TTS use cases could include, but are not limited to:
- Virtual agents based on fictional personas (e.g., on-demand web searching, IoT control, or customer support provided by a company's branded character).
- Entertainment media for use in fictional content (e.g., movies, video games, tv, recorded music, or audio books)
- Accredited educational institutions or educational media (e.g., interactive lesson plans or guided museum tours)
- Assistive technology and real-time translation (e.g., ALS-afflicted individuals preserving their voices)
- Public service announcements using fictional personas (e.g., airport or train terminal announcements)