電話語音資料的語音服務Speech Services for telephony data

透過有線電話、行動電話和無線電所產生的電話語音資料通常為低品質且在 8 KHz 範圍內的窄頻,其會在語音轉換文字時建立查問。Telephony data that is generated through landlines, mobile phones, and radios are typically low quality, and narrowband in the range of 8 KHz, which creates challenges when converting speech-to-text. Azure 語音服務中最新的語音辨識模型擅於轉譯此電話語音資料,即使在人們難以理解資料的情況下亦然。The latest speech recognition models from Azure Speech Services excel at transcribing this telephony data, even in cases when the data is difficult for a human to understand. 這些模型是以大量電話語音資料進行定型,具有市面上最佳的辨識精確度,即使在吵雜的環境中亦然。These models are trained with large volumes of telephony data, and have best in market recognition accuracy, even in noisy environments.

語音轉換文字的常見案例是轉譯可能來自各種系統 (例如互動式語音回應 (IVR)) 的大量電話語音資料。A common scenario for speech-to-text is transcribing large volumes of telephony data that may come from various systems, such as Interactive Voice Response (IVR). 這些系統提供的音訊可以是立體聲或單聲道,而且是幾乎沒有對訊號進行後處理的原始狀態。The audio these systems provide can be stereo or mono, and raw with little-to-no post processing done on the signal. 企業可以使用語音服務與整合語音模型,透過許多音訊擷取系統來取得高品質的轉譯。Using Speech Services and the Unified speech model, a business can get high-quality transcriptions, whatever the systems used to capture audio.

電話語音資料可用來進一步了解您客戶的需求、找出新的行銷商機,或評估話務中心服務 專員的績效。Telephony data can be used to better understand your customers' needs, identify new marketing opportunities, or evaluate the performance of call center agents. 在資料轉譯之後,企業可以使用輸出來改善遙測、識別關鍵片語,或分析客戶情感。After the data is transcribed, a business can use the output for improved telemetry, identifying key phrases, or analyzing customer sentiment.

此頁面中概述的技術都是由 Microsoft 在內部針對各種支援通話處理服務 (即時和批次模式) 提供。The technologies outlined in this page are by Microsoft internally for various support call processing services, both in real-time and batch mode.

讓我們檢閱 Azure 語音服務所提供的一些技術和相關功能。Let's review some of the technology and related features Azure Speech Services offer.

重要

語音服務整合模型是以各種不同的資料進行定型,並且對從聽寫到電話語音分析等案例提供單一模型解決方案。Speech Services Unified model is trained with diverse data and offers a single model solution to a number of scenario from Dictation to Telephony analytics.

適用於話務中心的 Azure 技術Azure Technology for Call Centers

在語音服務的功能層面以外,其主要目的 (若套用到話務中心) 是要改善客戶體驗。Beyond the functional aspect of the Speech Services their primary purpose – when applied to the call center – is to improve the customer experience. 這方面有三個清楚的領域存在:Three clear domains exist in this regard:

  • 通話後分析,也就是通話錄音的批次處理Post-call analytics that is, batch processing of call recordings
  • 音訊訊號的即時分析處理,可在通話進行時擷取各種見解 (以情感作為顯著的使用案例),以及Real-time analytics processing of the audio signal to extract various insights as the call is taking place (with sentiment being a prominent use case) and
  • 虛擬助理 (Bot),推動客戶與 bot 之間的對話,以嘗試在沒有服務專員參與的情況下解決客戶問題,或作為 AI 通訊協定的應用程式來協助服務專員。Virtual Assistants (Bots), either driving the dialogue between the customer and the bot in an attempt to solve the customer's issue with no agent participation, or being the application of AI protocols to assist the agent.

話務中心轉譯架構底下的圖片描繪批次案例實作的典型架構圖。A typical architecture diagram of the implementation of a batch scenario is depicted in the picture below Call center transcription architecture

語音分析技術元件Speech Analytics Technology Components

不論是通話後或即時領域,Azure 都會提供一組成熟且新興的技術來改善客戶體驗。Whether the domain is post-call or real-time, Azure offers a set of mature and emerging set of technologies to improve the customer experience.

語音轉換文字 (STT)Speech to text (STT)

語音轉換文字是任何話務中心解決方案中最受歡迎的功能。Speech-to-text is the most sought after feature in any call center solution. 因為許多下游分析程序都依賴轉譯的文字,所以字組錯誤率 (WER) 非常重要。Since many of the downstream analytics processes rely on transcribed text, the word error rate (WER) is of utmost importance. 話務中心轉譯的重要挑戰之一就是話務中心常見的雜訊 (例如,其他服務專員在背景中說話)、各種豐富的語言地區設定和方言,以及低品質的實際電話訊號。One of the key challenges in call center transcription is the noise that’s prevalent in the call center (for example other agents speaking in the background), the rich variety of language locales and dialects as well as the low quality of the actual telephone signal. WER 與針對特定地區設定定型原音和語言模型的程度高度相互關聯,因此能夠自訂模型以符合您的地區設定很重要。WER is highly correlated with how well the acoustic and language models are trained for a given locale, thus being able to customize the model to your locale is important. 我們最新的整合 4.x 版模型是轉譯精確度和延遲的解決方案。Our latest Unified version 4.x models are the solution to both transcription accuracy and latency. 使用成千上萬小時的原音資料和大量語彙資訊進行定型,整合模型是市場中轉譯話務中心資料的最精確模型。Trained with tens of thousands of hours of acoustic data and billions of lexical information Unified models are the most accurate models in the market to transcribe call center data.

情感Sentiment

運用於話務中心空間時,測量客戶是否有良好的體驗是語音分析的最重要領域之一。Gauging whether the customer had a good experience is one of the most important areas of Speech analytics when applied to the call center space. 我們的批次轉譯 API會提供每個語句的情感分析。Our Batch Transcription API offers sentiment analysis per utterance. 您可以彙總在通話文字記錄中取得的一組值,以判斷您的服務專員和客戶的通話情感。You can aggregate the set of values obtained as part of a call transcript to determine the sentiment of the call for both your agents and the customer.

無聲 (未交談)Silence (non-talk)

35% 的支援通話通常會是我們所謂的未交談時間。It is not uncommon for 35 percent of a support call to be what we call non-talk time. 發生未交談的案例如下:服務專員正在查閱客戶先前的案例記錄、服務專員正在使用工具,使其存取客戶的桌面並執行功能、客戶正在等候轉接等等。Some scenarios which non-talk occurs are: agents looking up prior case history with a customer, agents using tools which allow them to access the customer's desktop and perform functions, customers sitting on hold waiting for a transfer and so on. 能測量通話中何時會發生無聲情形極為重要,因為在這幾種案例中會出現數種重要客戶敏感性,而且都發生在通話中。It is extremely important to can gauge when silence is occurring in a call as there are number of important customer sensitivities that occur around these types of scenarios and where they occur in the call.

轉譯Translation

有些公司正在試驗提供外國語言支援通話的翻譯文字記錄,以便交付經理了解其客戶的全球體驗。Some companies are experimenting with providing translated transcripts from foreign languages support calls so that delivery managers can understand the world-wide experience of their customers. 我們可提供卓越的翻譯功能。Our translation capabilities are unsurpassed. 我們可以從大量的地區設定進行音訊到音訊或音訊到文字的翻譯。We can translate audio to audio or audio to text from a large number of locales.

文字轉換語音Text to Speech

實作可與客戶互動的 Bot 時,文字轉換語音是另一個重要領域。Text-to-speech is another important area in implementing bots that interact with the customers. 典型的路徑是客戶說話、其語音會轉譯為文字、分析文字的意圖、根據所辨識的意圖來合成回應,然後將資產呈現給客戶或產生合成的語音回應。The typical pathway is that the customer speaks, their voice is transcribed to text, the text is analyzed for intents, a response is synthesized based on the recognized intent, and then an asset is either surfaced to the customer or a synthesized voice response is generated. 當然,這一切都快速發生 – 因此延遲是這些系統的重要成功元件。Of course all of this has to occur quickly – thus latency is an important component in the success of these systems.

若考量到各種相關技術,例如語音轉換文字LUISBot Framework文字轉換語音,我們的端對端延遲相當低。Our end-to-end latency is pretty low considering the various technologies involved such as Speech-to-text, LUIS, Bot Framework, Text-to-Speech.

我們的新語音也難以與人聲辨別。Our new voices are also indistinguishable from human voices. 您可以使用我們的語音,賦予您的 Bot 獨特的性格。You can use out voices to give your bot its unique personality.

分析的另一個主要部分是要識別發生特定事件或體驗時的互動。Another staple of analytics is to identify interactions where a specific event or experience has occurred. 通常會透過以下兩種方法完成:隨選搜尋或更具結構化的查詢,在前者中,使用者只要輸入一個片語,系統就會回應,而在後者中,析師可以建立一組邏輯陳述式來識別通話中的案例,然後根據這組查詢來編製每次通話的索引。This is typically done with one of two approaches, either an ad hoc search where the user simply types a phrase and the system responds, or a more structured query, where an analyst can create a set of logical statements that identify a scenario in a call, and then each call can be indexed against those set of queries. 以下普遍存在的合規性陳述是一個很好的搜尋範例:「為了確保品質,本次通話會進行錄音...A good search example is the ubiquitous compliance statement “this call shall be recorded for quality purposes… 」– 因為很多公司都想確保其服務專員在通話實際錄音前,會將此免責聲明提供給客戶。“ – as many companies want to make sure that their agents are providing this disclaimer to customers before the call is actually recorded. 大多數的分析系統都能夠分析查詢/搜尋演算法所找到的行為趨勢 – 因為此趨勢報告終究是分析系統的最重要功能之一。Most analytics systems have the ability to trend the behaviors found by query /search algorithms – as this reporting of trends is ultimately one of the most important functions of an analytics system. 透過認知服務目錄,可以利用編製索引和搜尋功能大幅增強端對端解決方案。Through Cognitive services directory your end to end solution can be significantly enhanced with indexing and search capabilities.

關鍵片語擷取Key Phrase Extraction

這是更具挑戰性的分析應用領域之一,而且受惠於 AI 和 ML 的應用。This area is one of the more challenging analytics applications and one that is benefiting from the application of AI and ML. 以下的主要案例是要推斷客戶意圖。The primary scenario here is to infer the customer intent. 客戶為何來電?Why is the customer calling? 客戶有何問題?What is the customer problem? 客戶為何有負面的體驗?Why did the customer have a negative experience? 我們的文字分析服務提供一組立即可用的分析,可快速升級您的端對端解決方案以擷取這些重要關鍵字或片語。Our Text analytics service provides a set of analytics out of the box for quickly upgrading your end to end solution to extract those important keywords or phrases.

我們現在可以更仔細地看看語音辨識的批次處理和即時管線。Let's now have a look at the batch processing and the real-time pipelines for speech recognition in a bit more detail.

話務中心資料的批次轉譯Batch transcription of call center data

為了轉譯大量音訊,我們開發了批次轉譯 APIFor transcribing bulk of audio we developed the Batch Transcription API. 批次轉譯 API 的開發用意是要以非同步方式轉譯大量的音訊資料。The Batch Transcription API was developed to transcribe large amounts of audio data asynchronously. 關於話務中心資料的轉譯,我們的解決方案是以下列要素為基礎:With regards to transcribing call center data, our solution is based on these pillars:

  • 精確度:透過第四代整合模型,我們可提供卓越的轉譯品質。Accuracy: With fourth-generation Unified models, we offer unsurpassed transcription quality.
  • 延遲:我們了解在進行大量轉譯時,需要快速轉譯。Latency: We understand that when doing bulk transcriptions, the transcriptions are needed quickly. 透過批次轉譯 API 起始的轉譯作業會立即排入佇列,而一旦作業開始執行,其執行速度比即時轉譯還要快。The transcription jobs initiated via the Batch Transcription API will be queued immediately, and once the job starts running it's performed faster than real-time transcription.
  • 安全性:我們了解通話內容可能包含敏感性資料。Security: We understand that calls may contain sensitive data. 請放心,安全性是我們的最高優先順序之一。Rest assured that security is one of our highest priorities. 我們的服務已取得 ISO、SOC、HIPAA、PCI 認證。Our service has obtained ISO, SOC, HIPAA, PCI certifications.

話務中心會每天產生大量的音訊資料。Call Centers generate large volumes of audio data on a daily basis. 如果您的企業將電話語音資料儲存在集中位置 (例如 Azure 儲存體),您可使用批次轉譯 API以非同步方式要求及接收轉譯。If your business stores telephony data in a central location, such as Azure Storage, you can use the Batch Transcription API to asynchronously request and receive transcriptions.

典型的解決方案會使用下列服務:A typical solution uses these services:

  • Azure 語音服務用於將語音轉譯為文字。Azure Speech Services are used to transcribe speech-to-text. 需有語音服務的標準訂用帳戶 (SO),才能使用批次轉譯 API。A standard subscription (SO) for the Speech Services is required to use the Batch Transcription API. 免費訂用帳戶 (F0) 將無法運作。Free subscriptions (F0) will not work.
  • Azure 儲存體用來儲存電話語音資料,以及批次轉譯 API 所傳回的文字記錄。Azure Storage is used to store telephony data, and the transcripts returned by the Batch Transcription API. 此儲存體帳戶應該使用通知,特別是在有新檔案新增的時候。This storage account should use notifications, specifically for when new files are added. 這些通知用來觸發轉譯程序。These notifications are used to trigger the transcription process.
  • Azure Functions 用來建立每次錄音的共用存取簽章 (SAS) URI,並觸發 HTTP POST 要求以開始轉譯。Azure Functions is used to create the shared access signatures (SAS) URI for each recording, and trigger the HTTP POST request to start a transcription. 此外,Azure Functions 用來建立使用批次轉譯 API 擷取和刪除轉譯的要求。Additionally, Azure Functions is used to create requests to retrieve and delete transcriptions using the Batch Transcription API.
  • Webhook 用於在轉譯完成時收到通知。WebHooks are used to get notifications when transcriptions are completed.

我們在內部使用上述技術來支援批次模式的 Microsoft 客戶通話。Internally we are using the above technologies to support Microsoft customer calls in Batch mode. 批次架構Batch Architecture

話務中心資料的即時轉譯Real-time transcription for call center data

有些企業需要即時轉譯交談。Some businesses are required to transcribe conversations in real-time. 即時轉譯可用來識別關鍵字組及觸發交談相關內容和資源的搜尋,以便監視情感,進而改善可存取性,或為不是母語人士的客戶和服務專員提供翻譯。Real-time transcription can be used to identify key-words and trigger searches for content and resources relevant to the conversation, for monitoring sentiment, to improve accessibility, or to provide translations for customers and agents who aren't native speakers.

對於需要即時轉譯的案例,我們建議使用語音 SDKFor scenarios that require real-time transcription, we recommend using the Speech SDK. 語音轉換文字目前以超過 20 種語言提供,而 SDK 則以 C++、C#、Java、Python、Node.js、Objective-C 和 JavaScript 提供。Currently, speech-to-text is available in more than 20 languages, and the SDK is available in C++, C#, Java, Python, Node.js, Objective-C, and JavaScript. 您可以在 GitHub 上取得每種語言的範例。Samples are available in each language on GitHub. 如需最新消息和更新,請參閱版本資訊For the latest news and updates, see Release notes.

我們在內部使用上述技術來即時分析 Microsoft 客戶通話。Internally we are using the above technologies to analyze in real-time Microsoft customer calls as they happen.

批次架構

IVR 上的字組A word on IVRs

使用語音 SDKREST API,即可將語音服務輕鬆地整合在任何解決方案中。Speech Services can be easily integrated in any solution by using either the Speech SDK or the REST API. 不過,話務中心轉譯可能需要額外的技術。However, call center transcription may require additional technologies. 通常,IVR 系統與 Azure 之間需要連線。Typically, a connection between an IVR system and Azure is required. 雖然我們不提供這類元件,但我們想要描述 IVR 連線的需求。Although we do not offer such components, we would like to describe what a connection to an IVR entails.

有數個 IVR 或電話語音服務產品 (例如 Genesys 或 AudioCodes) 提供整合功能,可用來啟用對 Azure 服務的輸入和輸出音訊傳遞。Several IVR or telephony service products (such as Genesys or AudioCodes) offer integration capabilities that can be leveraged to enable inbound and outbound audio passthrough to an Azure Service. 基本上,自訂 Azure 服務可能提供一個特定介面,用來定義電話通話工作階段 (例如通話開始或通話結束),並公開一個 WebSocket API,以接收搭配語音服務使用的輸入串流音訊。Basically, a custom Azure service might provide a specific interface to define phone call sessions (such as Call Start or Call End) and expose a WebSocket API to receive inbound stream audio that is used with the Speech Services. 輸出回應 (例如交談轉譯或與 Bot Framework 的連線) 可與 Microsoft 的文字轉換語音服務合成並傳回給 IVR 進行播放。Outbound responses, such as conversation transcription or connections with the Bot Framework, can be synthesized with Microsoft's text-to-speech service and returned to the IVR for playback.

另一個案例是直接 SIP 整合。Another scenario is Direct SIP integration. Azure 服務連線到 SIP 伺服器,從而取得輸入資料流和輸出資料流,其用於語音轉換文字和文字轉換語音階段。An Azure service connects to a SIP Server, thus getting an inbound stream and an outbound stream, which is used for the speech-to-text and text-to-speech phases. 為了連線到 SIP 伺服器,因而有一些商業軟體供應項目,例如 Ozeki SDK,或 Teams 通話與會議 API (目前為搶鮮版 (Beta)),其設計訴求是要支援這種音訊通話案例。To connect to a SIP Server there are commercial software offerings, such as Ozeki SDK, or the Teams calling and meetings API (currently in beta), that are designed to support this type of scenario for audio calls.

自訂現有的體驗Customize existing experiences

Azure 語音服務可順利地與內建模型搭配使用,不過,您可以進一步自訂及調整體驗,以搭配您的產品或環境。Azure Speech Services works well with built-in models, however, you may want to further customize and tune the experience for your product or environment. 從原音模型調整到專屬於自身品牌的獨特聲音音調,都是自訂選項的範圍。Customization options range from acoustic model tuning to unique voice fonts for your brand. 建立自訂模型之後,您可以即時或以批次模式,將其與任何 Azure 語音服務搭配使用。After you've built a custom model, you can use it with any of the Azure Speech Services both in real-time or in batch mode.

語音服務Speech service 模型Model 描述Description
語音轉換文字Speech-to-text 原音模型Acoustic model 針對用於特定環境 (例如汽車或工廠) 的應用程式、工具或裝置建立自訂原音模型,而這每一個的錄音條件都較特殊。Create a custom acoustic model for applications, tools, or devices that are used in particular environments like in a car or on a factory floor, each with specific recording conditions. 例如,帶有口音的語音、特定背景雜音或使用特定麥克風來錄音。Examples include accented speech, specific background noises, or using a specific microphone for recording.
語言模型Language model 建立自訂語言模型來提升特定產業的詞彙和文法轉譯,例如醫療術語或 IT 專業術語。Create a custom language model to improve transcription of industry-specific vocabulary and grammar, such as medical terminology, or IT jargon.
發音模型Pronunciation model 使用自訂發音模型,您可以定義語音形式和顯示字組或字詞。With a custom pronunciation model, you can define the phonetic form and display of a word or term. 它可用於處理自訂的字詞,如產品名稱或縮略字。It's useful for handling customized terms, such as product names or acronyms. 您只需要有發音檔 - 簡單的 .txt 檔。All you need to get started is a pronunciation file -- a simple .txt file.
文字轉換語音Text-to-speech 聲音音調Voice font 自訂聲音音調可讓您為自己的品牌建立可辨識的獨特聲音。Custom voice fonts allow you to create a recognizable, one-of-a-kind voice for your brand. 只需少量資料即可開始建立。It only takes a small amount of data to get started. 提供的資料愈多,您的聲音音調聽起來就愈自然且愈像真人。The more data that you provide, the more natural and human-like your voice font will sound.

範例程式碼Sample code

每個 Azure 語音服務的範例程式碼皆可在 GitHub 上取得。Sample code is available on GitHub for each of the Azure Speech Services. 這些範例包含常見案例,例如:從檔案或資料流讀取音訊、連續辨識、一次性辨識及使用自訂模型。These samples cover common scenarios like reading audio from a file or stream, continuous and single-shot recognition, and working with custom models. 使用下列連結來檢視 SDK 和 REST 範例:Use these links to view SDK and REST samples:

參考文件Reference docs

後續步驟Next steps