Data, privacy, and security for Custom Neural Voice
This article provides details regarding how Custom Neural Voice data provided by you is processed, used and stored. As an important reminder, you are responsible for your use and the implementation of this technology and are required to obtain all necessary permissions from voice talents for the processing of his/her voice data to develop a synthetic voice as well as any licenses, permissions or other proprietary rights required for the content you input into the text-to-speech (“TTS”) service, part of Speech in Azure Cognitive Services, to generate audio content in the synthetic voice. Some jurisdictions may impose special legal requirements for the collection, processing and storage of certain categories of data, such as biometric data and mandate disclosing the use of synthetic voices to users. Before using Custom Neural Voice and the TTS service for the processing and storage of data and creation of synthetic speech, you must ensure compliance with any such legal requirements that may apply to you.
What data does Custom Neural Voice and TTS process?
Custom Neural Voice processes the following types of data:
Recorded statement file of voice talent. When using the Speech Studio, customers are required to upload a recorded statement of the voice talent that acknowledges that his/her voice will be used by customer to create synthetic voice(s).
When preparing your recording script, make sure you include the below sentence to acquire the voice talent acknowledgement.
“I [state your first and last name] am aware that recordings of my voice will be used by [state the name of the company] to create and use a synthetic version of my voice.”
Different versions of the copy are provided based on the language you select to create Custom Neural Voice for.
Training data (including audio files and related text transcripts). This includes audio recordings from the voice talent who has agreed to use his/her voice for the model training and the related text transcripts. You can provide their own text transcriptions of audio or use the automated speech recognition transcription feature available within the Speech Studio to generate a text transcription of the audio. Both the audio recordings and the text transcription files will be used as the voice model Training Data.
Text as the test script. You can upload your own text-based scripts to evaluate and test the quality of the custom voice model by generating speech synthesis audio samples.
Text input for speech synthesis. This is the text you select and send to TTS to generate audio content using your custom neural voice.
How does Custom Neural Voice and TTS process data?
The diagram below illustrates how your data is processed. This diagram covers three different types of processing: how Microsoft verifies voice files of the voice talent prior to the custom neural voice model training, how Microsoft creates a custom neural voice model with your training data, and how TTS processes your text input to generate audio content.
Voice file verification
Microsoft requires customers to upload an audio file with a recorded statement from its voice talent acknowledging Customer’s use of his/her voice to the Speech Studio. Microsoft may use Microsoft’s speech-to-text/speech recognition technology on this recorded statement to transcribe it to text and verify the content in the recording matches the pre-defined script provided by Microsoft. This audio statement, along with the description information you provide with the audio is used to create a voice talent profile. You must associate training data with the relevant voice talent profile when initiating custom neural voice training.
Microsoft may process biometric voice signatures from the recorded voice statement file of the voice talent and from randomized audios from the training datasets in order to ascertain that the voice signature in each of the audio recordings matches the same speaker with reasonable confidence using the Speaker Verification feature of Speech, in Azure Cognitive Services. A voice signature may also be called a “voice template” or “voiceprint” and it is a numeric vector that represents an individual’s voice characteristics that is extracted from audio recordings of a person speaking. This technical safeguard is intended to help prevent misuse of Custom Neural Voice, by, for example, preventing customers from training voice models with audio recordings and using it to spoof a voice without a speaker’s knowledge or consent.
The voice signatures are used by Microsoft solely for the purposes of speaker verification or as otherwise necessary to investigate misuse of the services
The Online Services Data Protection Addendum (“DPA”) sets forth customers and Microsoft’s obligations with respect to the processing and security of Customer Data and Personal Data in connection with Azure and is incorporated by reference into customers enterprise agreement for Azure services. Microsoft’s data processing in this section is governed under the Legitimate Interest Business operations section of the Data Protection Addendum.
Training a custom neural voice model
The training data (speech audios) submitted to the Speech Studio is pre-processed using automated tools for quality checking including data format check, pronunciation scoring, noise detection, script mapping, etc.. The training data is then imported to the model training component of the custom voice platform. During the training process, the training data (both voice audio and text transcriptions) are decomposed into fine-grained mappings of voice acoustics and text, such as a sequence of phonemes. Through further complex machine leaning modeling, this is built into a voice model, which then can be used to generates voice that sounds like the voice talent. The voice model is a text-to-speech computer model that can mimic unique vocal characteristics of a target speaker. It represents a set of parameters in binary format that is not human readable and does not contain audio recordings.
Customer’s training data is only used to develop customer’s custom voice model and is not used by Microsoft to train or improve any Microsoft TTS voice models.
Speech synthesis/audio content generation
Once the voice model is created, you can use it to create audio content through the TTS service with two different options.
For real time speech synthesis, you send the input text to the TTS service via the TTS SDK or RESTful API. TTS processes the input text and returns output audio content files in real time to the your application that made the request.
For asynchronous synthesis of long audio (batch synthesis), you submit the input text files to the TTS batch service via the Long Audio API to asynchronously create audios longer than 10 minutes (for example audio books or lectures). Unlike synthesis performed using the text-to-speech API, responses aren't returned in real time with the Long Audio API. Audios are created asynchronously, and you can access and download the synthesized audios when it is made available from the batch synthesis service.
You can also use your custom voice to generate audio content through a no-code Audio Content Creation tool, and choose to save your text input or output audio content with the tool in Azure storage.
Data storage and retention
Recorded statement and Speaker Verification data: The voice signatures are used by Microsoft solely for the purposes of speaker verification or as otherwise necessary to investigate misuse of the services. The voice signatures will be retained only for the time duration necessary to perform such speaker verification, which may occur from time to time. Microsoft may require this verification prior to enabling customers to train or retrain custom voice models in the Speech Studio, or as otherwise necessary. Microsoft will retain the recorded statement file and voice talent profile data for as long as necessary in order to preserve the security and integrity of Speech in Azure Cognitive Services.
Custom Neural Voice models: While the customer maintains the exclusive right to use the Custom Neural Voice model created at the customer’s instruction, Microsoft may independently retain and use a copy of Custom Neural Voice models for as long as necessary for the sole purpose of protecting the security and integrity of Azure Speech Services. Microsoft’s retention and processing of the Custom Neural Voice models for the purpose stated in this section is governed under the Legitimate Interest Business operations section of the Online Services Data Protection Addendum.
Microsoft will secure and store a copy of Voice Talent’s recorded statement file(s) and Custom Neural Voice model(s) with the same high-level security that it uses for its other Azure Services. To learn more about Microsoft's privacy and security commitments visit the Microsoft TrustCenter.
Training data: Customers submit voice training data to generate voice models via Speech Studio, and it will be retained and stored by default in Azure storage (See Azure Storage encryption for data at REST for details). Customers can access and delete any of the training data used to build the voice models via Speech Studio.
Customers may also choose to use their own storage via BYOS (Bring Your Own Storage) to manage storage of their training data. With this storage method, training data may be accessed only for the purposes of voice model training and will otherwise be stored via BYOS.
Text input for speech synthesis: Microsoft does not retain or store the input text provided by customers with the real-time synthesis TTS API. If you are useing the Long Audio TTS API, the scripts are stored in Azure storage to process the batch synthesis request. The input text can be deleted via the delete API at any time.
Output audio content: Microsoft does not store the audio content that are generated with the real-time synthesis API. If you are using the Long Audio API, the output audio content is stored in Azure storage. Thse audios can be removed at any time via the delete operation.
To learn more about Microsoft's privacy and security commitments visit the Microsoft Trust Center.