準備資料以建立自訂語音Prepare data to create a custom voice

當您準備好為應用程式建立自訂文字轉換語音的語音時,第一個步驟是收集錄音和相關聯的腳本,以開始訓練語音模型。When you're ready to create a custom text-to-speech voice for your application, the first step is to gather audio recordings and associated scripts to start training the voice model. 語音服務會使用此資料來建立唯一的語音微調,以符合錄音中的聲音。The Speech service uses this data to create a unique voice tuned to match the voice in the recordings. 訓練語音之後,您就可以開始在應用程式中合成語音。After you've trained the voice, you can start synthesizing speech in your applications.

您可以從少量的資料開始,以建立概念證明。You can start with a small amount of data to create a proof of concept. 不過,您提供的資料越多,您的自訂聲音就會愈自然。However, the more data that you provide, the more natural your custom voice will sound. 在您可以訓練自己的文字轉換語音語音模型之前,您需要錄音和相關聯的文字轉譯。Before you can train your own text-to-speech voice model, you'll need audio recordings and the associated text transcriptions. 在此頁面上,我們將探討資料類型、其使用方式,以及如何進行管理。On this page, we'll review data types, how they are used, and how to manage each.

資料類型Data types

語音訓練資料集包含音訊錄製,以及具有相關聯轉譯的文字檔。A voice training dataset includes audio recordings, and a text file with the associated transcriptions. 每個音訊檔案應包含單一語句(單一句子或對話系統的單一回合),且長度不超過15秒。Each audio file should contain a single utterance (a single sentence or a single turn for a dialog system), and be less than 15 seconds long.

在某些情況下,您可能未備妥正確的資料集,而且想要使用可用的音訊檔案(簡短或長)來測試自訂語音訓練,不論是否有文字記錄。In some cases, you may not have the right dataset ready and will want to test the custom voice training with available audio files, short or long, with or without transcripts. 我們提供工具(搶鮮版(Beta)),協助您將音訊分割成語句,並使用批次轉譯 API來準備文字記錄。We provide tools (beta) to help you segment your audio into utterances and prepare transcripts using the Batch Transcription API.

下表列出資料類型,以及如何使用它們來建立自訂文字轉換語音語音模型。This table lists data types and how each is used to create a custom text-to-speech voice model.

資料類型Data type 描述Description 使用時機When to use 需要額外的服務Additional service required 定型模型的數量Quantity for training a model 地區設定Locale(s)
個別語句 + 相符文字記錄Individual utterances + matching transcript 做為個別語句之音訊檔案(.wav)的集合(.zip)。A collection (.zip) of audio files (.wav) as individual utterances. 每個音訊檔案的長度應為15秒或更短,並與格式化的文字記錄(.txt)配對。Each audio file should be 15 seconds or less in length, paired with a formatted transcript (.txt). 具有相符文字記錄的專業錄音Professional recordings with matching transcripts 準備好進行訓練。Ready for training. En-us 和 zh 都沒有硬性需求。No hard requirement for en-US and zh-CN. 適用于其他地區設定的超過 2000 + 相異語句。More than 2,000+ distinct utterances for other locales. 所有自訂語音地區設定All Custom Voice locales
長音訊 + 文字記錄(搶鮮版(Beta))Long audio + transcript (beta) 長 unsegmented 音訊檔案(長度超過20秒)的集合(.zip),與包含所有說話單字的文字記錄(.txt)配對。A collection (.zip) of long, unsegmented audio files (longer than 20 seconds), paired with a transcript (.txt) that contains all spoken words. 您有音訊檔案和相符的文字記錄,但它們不會分割成語句。You have audio files and matching transcripts, but they are not segmented into utterances. 分割(使用批次轉譯)。Segmentation (using batch transcription).
必要時的音訊格式轉換。Audio format transformation where required.
無硬性需求No hard requirement 所有自訂語音地區設定All Custom Voice locales
僅限音訊(搶鮮版(Beta))Audio only (beta) 不含文字記錄的音訊檔案集合(.zip)。A collection (.zip) of audio files without a transcript. 您只有音訊檔案可供使用,而不需要文字記錄。You only have audio files available, without transcripts. 分割和文字記錄產生(使用批次轉譯)。Segmentation + transcript generation (using batch transcription).
必要時的音訊格式轉換。Audio format transformation where required.
無硬性需求No hard requirement 所有自訂語音地區設定All Custom Voice locales

檔案應該依類型分組至資料集,並以 zip 檔案的形式上傳。Files should be grouped by type into a dataset and uploaded as a zip file. 每個資料集只能包含單一資料類型。Each dataset can only contain a single data type.

注意

「免費訂閱」(F0)使用者和500適用于「標準訂用帳戶」(S0)使用者,每個訂用帳戶允許匯入的資料集數目上限為10個 .zip 檔案。The maximum number of datasets allowed to be imported per subscription is 10 .zip files for free subscription (F0) users and 500 for standard subscription (S0) users.

個別語句 + 相符文字記錄Individual utterances + matching transcript

您可以透過兩種方式來準備個別語句的錄製和相符的文字記錄。You can prepare recordings of individual utterances and the matching transcript in two ways. 撰寫腳本並讓配音人員讀取,或使用公開提供的音訊並轉譯為文字。Either write a script and have it read by a voice talent or use publicly available audio and transcribe it to text. 如果您採用第二種方式,請編輯音訊檔案中不流利的情況,例如「嗯」及其他補白音、口吃、喃喃自語或錯誤發音。If you do the latter, edit disfluencies from the audio files, such as "um" and other filler sounds, stutters, mumbled words, or mispronunciations.

若要產生良好的聲音音調,請在具有高品質麥克風的安靜房間內建立錄製。To produce a good voice font, create the recordings in a quiet room with a high-quality microphone. 一致的磁片區、說話速度、說話音調和表達舉止的語音是不可或缺的。Consistent volume, speaking rate, speaking pitch, and expressive mannerisms of speech are essential.

提示

若要建立語音以供生產環境使用,建議您使用專業錄音室和配音人員。To create a voice for production use, we recommend you use a professional recording studio and voice talent. 如需詳細資訊,請參閱如何為客製化語音錄製語音範例For more information, see How to record voice samples for a custom voice.

音訊檔案Audio files

每個音訊檔案應包含單一語句(單一句子或對話系統的單一回合),長度不超過15秒。Each audio file should contain a single utterance (a single sentence or a single turn of a dialog system), less than 15 seconds long. 所有檔案都必須是相同的語音語言。All files must be in the same spoken language. 不支援多語言自訂文字轉換語音的語音,但英文 bi 語言除外。Multi-language custom text-to-speech voices are not supported, with the exception of the Chinese-English bi-lingual. 每個音訊檔案都必須有唯一的數值檔案名,副檔名為 .wav。Each audio file must have a unique numeric filename with the filename extension .wav.

準備音訊時,請遵循這些指導方針。Follow these guidelines when preparing audio.

屬性Property Value
檔案格式File format RIFF (.wav),分組為 .zip 檔案RIFF (.wav), grouped into a .zip file
取樣率Sampling rate 至少 16000 HzAt least 16,000 Hz
樣本格式Sample format PCM,16 位元PCM, 16-bit
檔案名稱File name 數值,副檔名為 .wav。Numeric, with .wav extension. 不允許重複的檔案名。No duplicate file names allowed.
音訊長度Audio length 少於15秒Shorter than 15 seconds
封存格式Archive format .zip.zip
封存大小上限Maximum archive size 2048 MB2048 MB

注意

取樣率低於 16000 Hz 的 .wav 檔案將會遭到拒絕。.wav files with a sampling rate lower than 16,000 Hz will be rejected. 如果 .zip 檔案包含具有不同採樣速率的 .wav 檔案,則只會匯入等於或高於 16000 Hz 的檔案。If a .zip file contains .wav files with different sample rates, only those equal to or higher than 16,000 Hz will be imported. 入口網站目前最多可匯入 200 MB 的 .zip 封存。The portal currently imports .zip archives up to 200 MB. 不過,您可以上傳多個封存。However, multiple archives can be uploaded.

文字記錄Transcripts

轉譯檔案是純文字檔案。The transcription file is a plain text file. 使用這些指導方針來準備您的轉譯。Use these guidelines to prepare your transcriptions.

屬性Property Value
檔案格式File format 純文字 (.txt)Plain text (.txt)
編碼格式Encoding format ANSI/ASCII、UTF-8、UTF-8-BOM、UTF-16-LE 或 UTF-16-是。ANSI/ASCII, UTF-8, UTF-8-BOM, UTF-16-LE, or UTF-16-BE. 若為 zh-CN,則不支援 ANSI/ASCII 和 UTF-8 編碼。For zh-CN, ANSI/ASCII and UTF-8 encodings are not supported.
每一行的語句數目# of utterances per line 逐一轉譯檔案的每一行應包含其中一個音訊檔案的名稱,後面接著相對應的轉譯。One - Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. 檔案名稱和文字記錄應該以定位字元 (\t) 分隔。The file name and transcription should be separated by a tab (\t).
檔案大小上限Maximum file size 2048 MB2048 MB

以下是在一個 .txt 檔案中,語句由語句組織的文字記錄範例:Below is an example of how the transcripts are organized utterance by utterance in one .txt file:

0000000001[tab] This is the waistline, and it's falling.
0000000002[tab] We have trouble scoring.
0000000003[tab] It was Janet Maslin.

很重要的是,文字記錄是對應音訊的100% 精確轉譯。It’s important that the transcripts are 100% accurate transcriptions of the corresponding audio. 文字記錄中的錯誤會導致定型期間的品質損失。Errors in the transcripts will introduce quality loss during the training.

提示

建立生產文字轉換語音的語音時,選取語句(或撰寫腳本),將語音涵蓋範圍和效率納入考慮。When building production text-to-speech voices, select utterances (or write scripts) that take into account both phonetic coverage and efficiency. 無法取得想要的結果嗎?Having trouble getting the results you want? 請洽詢自訂語音小組,以深入瞭解我們的諮詢。Contact the Custom Voice team to find out more about having us consult.

長音訊 + 文字記錄(搶鮮版(Beta))Long audio + transcript (beta)

在某些情況下,您可能無法使用分段的音訊。In some cases, you may not have segmented audio available. 我們會透過自訂語音入口網站提供服務(搶鮮版(Beta)),協助您分割較長的音訊檔案,並建立轉譯。We provide a service (beta) through the custom voice portal to help you segment long audio files and create transcriptions. 請記住,這項服務會向您的語音轉文字訂用帳戶使用量收費。Keep in mind, this service will be charged toward your speech-to-text subscription usage.

注意

長音訊分割服務會利用語音轉換文字的批次轉譯功能,其僅支援標準訂用帳戶(S0)使用者。The long-audio segmentation service will leverage the batch transcription feature of speech-to-text, which only supports standard subscription (S0) users. 在處理分割期間,您的音訊檔案和文字記錄也會傳送至「自訂語音服務」,以精簡辨識模型,讓您的資料能獲得更好的精確度。During the processing of the segmentation, your audio files and the transcripts will also be sent to the Custom Speech service to refine the recognition model so the accuracy can be improved for your data. 在此程式期間將不會保留任何資料。No data will be retained during this process. 分割完成後,只會儲存語句分割和其對應文字記錄,以供您的下載和訓練之用。After the segmentation is done, only the utterances segmented and their mapping transcripts will be stored for your downloading and training.

音訊檔案Audio files

準備要進行分割的音訊時,請遵循這些指導方針。Follow these guidelines when preparing audio for segmentation.

屬性Property Value
檔案格式File format RIFF (.wav),其取樣率至少為 16 khz-16 位,PCM 中的速率至少為 256 KBps,並分組為 .zip 檔案RIFF (.wav) with a sampling rate of at least 16 khz-16-bit in PCM or .mp3 with a bit rate of at least 256 KBps, grouped into a .zip file
檔案名稱File name 支援 ASCII 和 Unicode 字元。ASCII and Unicode characters supported. 不允許重複的名稱。No duplicate names allowed.
音訊長度Audio length 超過20秒Longer than 20 seconds
封存格式Archive format .zip.zip
封存大小上限Maximum archive size 2048 MB2048 MB

所有的音訊檔案都應該分組成一個 zip 檔案。All audio files should be grouped into a zip file. 可以將 .wav 檔案和 mp3 檔案放入一個音訊 zip。It’s OK to put .wav files and .mp3 files into one audio zip. 例如,您可以上傳 zip 檔案,其中包含名為 ' kingstory ' 的音訊檔案、45-秒-長,以及另一個名為 ' queenstory. mp3 '、200-second/long 的音訊檔案。For example, you can upload a zip file containing an audio file named ‘kingstory.wav’, 45-second-long, and another audio named ‘queenstory.mp3’, 200-second-long. 所有的 mp3 檔案在處理後都會轉換成 .wav 格式。All .mp3 files will be transformed into the .wav format after processing.

文字記錄Transcripts

文字記錄必須備妥此表格中所列的規格。Transcripts must be prepared to the specifications listed in this table. 每個音訊檔案都必須與文字記錄相符。Each audio file must be matched with a transcript.

屬性Property Value
檔案格式File format 純文字(.txt),分組為 .zipPlain text (.txt), grouped into a .zip
檔案名稱File name 使用與相符的音訊檔案相同的名稱Use the same name as the matching audio file
編碼格式Encoding format UTF-8-僅限 BOMUTF-8-BOM only
每一行的語句數目# of utterances per line 沒有限制No limit
檔案大小上限Maximum file size 2048 MB2048 MB

此資料類型中的所有文字記錄檔案都應該分組成一個 zip 檔案。All transcripts files in this data type should be grouped into a zip file. 例如,您上傳的 zip 檔案包含名為 ' kingstory '、45秒長的音訊檔案,以及另一個名為 ' queenstory. mp3 ' 200 秒長的檔案。For example, you have uploaded a zip file containing an audio file named ‘kingstory.wav’, 45 seconds long, and another one named ‘queenstory.mp3’, 200 seconds long. 您將需要上傳另一個包含兩份文字記錄的 zip 檔案,一個名稱為 ' kingstory .txt ',另一個則是 ' queenstory .txt '。You will need to upload another zip file containing two transcripts, one named ‘kingstory.txt’, the other one ‘queenstory.txt’. 在每個純文字檔案中,您將提供相符音訊的完整正確轉譯。Within each plain text file, you will provide the full correct transcription for the matching audio.

成功上傳資料集之後,我們將協助您根據所提供的文字記錄,將音訊檔案分割成語句。After your dataset is successfully uploaded, we will help you segment the audio file into utterances based on the transcript provided. 您可以藉由下載資料集來檢查分割的語句和相符的文字記錄。You can check the segmented utterances and the matching transcripts by downloading the dataset. 唯一識別碼會自動指派給分段的語句。Unique IDs will be assigned to the segmented utterances automatically. 請務必確定您提供的文字記錄是100% 精確。It’s important that you make sure the transcripts you provide are 100% accurate. 文字記錄中的錯誤可以降低音訊分割期間的精確度,並在稍後的定型階段中進一步引進品質損失。Errors in the transcripts can reduce the accuracy during the audio segmentation and further introduce quality loss in the training phase that comes later.

僅限音訊(搶鮮版(Beta))Audio only (beta)

如果您沒有音訊錄製的轉譯,請使用 [僅限音訊] 選項來上傳您的資料。If you don't have transcriptions for your audio recordings, use the Audio only option to upload your data. 我們的系統可協助您分割和轉譯您的音訊檔案。Our system can help you segment and transcribe your audio files. 請記住,此服務將計入您的語音轉文字訂用帳戶使用量。Keep in mind, this service will count toward your speech-to-text subscription usage.

準備音訊時,請遵循這些指導方針。Follow these guidelines when preparing audio.

注意

長音訊分割服務會利用語音轉換文字的批次轉譯功能,其僅支援標準訂用帳戶(S0)使用者。The long-audio segmentation service will leverage the batch transcription feature of speech-to-text, which only supports standard subscription (S0) users.

屬性Property Value
檔案格式File format RIFF (.wav),其取樣率至少為 16 khz-16 位,PCM 中的速率至少為 256 KBps,並分組為 .zip 檔案RIFF (.wav) with a sampling rate of at least 16 khz-16-bit in PCM or .mp3 with a bit rate of at least 256 KBps, grouped into a .zip file
檔案名稱File name 支援 ASCII 和 Unicode 字元。ASCII and Unicode characters supported. 不允許重複的名稱。No duplicate name allowed.
音訊長度Audio length 超過20秒Longer than 20 seconds
封存格式Archive format .zip.zip
封存大小上限Maximum archive size 2048 MB2048 MB

所有的音訊檔案都應該分組成一個 zip 檔案。All audio files should be grouped into a zip file. 成功上傳資料集之後,我們將協助您根據語音批次轉譯服務,將音訊檔案分割成語句。Once your dataset is successfully uploaded, we will help you segment the audio file into utterances based on our speech batch transcription service. 唯一識別碼會自動指派給分段的語句。Unique IDs will be assigned to the segmented utterances automatically. 會透過語音辨識產生相符的文字記錄。Matching transcripts will be generated through speech recognition. 所有的 mp3 檔案在處理後都會轉換成 .wav 格式。All .mp3 files will be transformed into the .wav format after processing. 您可以藉由下載資料集來檢查分割的語句和相符的文字記錄。You can check the segmented utterances and the matching transcripts by downloading the dataset.

後續步驟Next steps