您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

什么是演讲者识别 (预览) ?What is Speaker Recognition (Preview)?

演讲者识别服务提供使用语音 biometry 根据其独特的语音特征来验证和识别扬声器的算法。The Speaker Recognition service provides algorithms that verify and identify speakers by their unique voice characteristics using voice biometry. 说话人识别用于回答“谁在说话?”的问题。Speaker Recognition is used to answer the question “who is speaking?”. 你为单个演讲者提供音频定型数据,这将基于演讲者语音的独特特征创建注册配置文件。You provide audio training data for a single speaker, which creates an enrollment profile based on the unique characteristics of the speaker's voice. 然后,您可以对此配置文件交叉检查音频语音样本,验证扬声器 (扬声器验证) ,还是对照一 已注册的扬声器配置文件交叉检查音频语音样本,以查看它是否与组 (发言人标识) 中的任何配置文件相匹配。You can then cross-check audio voice samples against this profile to verify that the speaker is the same person (speaker verification), or cross-check audio voice samples against a group of enrolled speaker profiles, to see if it matches any profile in the group (speaker identification). 与此相反, 演讲者 Diarization 在批处理操作中对音频段进行分组。In contrast, Speaker Diarization groups segments of audio by speaker in a batch operation.

说话人验证Speaker Verification

说话人验证利用密码或自由格式的语音输入来精简验证已注册的发言人标识的过程。Speaker Verification streamlines the process of verifying an enrolled speaker identity with either passphrases or free-form voice input. 它可用于验证个人是否有各种解决方案的安全、顺畅的客户参与,从呼叫中心的客户标识验证到无联系的设施访问。It can be used to verify individuals for secure, frictionless customer engagements in a wide range of solutions, from customer identity verification in call centers to contact-less facility access.

说话人验证是如何工作的?How does Speaker Verification work?

说话人验证流程图。

扬声器验证可以是文本相关的,也可以是与文本无关的。Speaker verification can be either text-dependent or text-independent. 文本从属 验证表示扬声器需要选择要在注册和验证阶段中使用的相同通行短语。Text-dependent verification means speakers need to choose the same passphrase to use during both enrollment and verification phases. 文本无关 的验证意味着演讲者可以在注册和验证短语中以日常语言说话。Text-independent verification means speakers can speak in everyday language in the enrollment and verification phrases.

对于 文本从属 验证,通过从一组预定义的短语中口述密码来注册演讲者的语音。For text-dependent verification, the speaker's voice is enrolled by saying a passphrase from a set of predefined phrases. 语音功能将从音频录音中提取,以形成唯一的语音签名,同时还会识别所选通行短语。Voice features are extracted from the audio recording to form a unique voice signature, while the chosen passphrase is also recognized. 语音签名和通行短语一起用于验证扬声器。Together, the voice signature and the passphrase are used to verify the speaker.

文本无关 的验证在注册期间或在要验证的音频示例中不会有什么限制,因为它仅将语音功能提取到分数相似性。Text-independent verification has no restrictions on what the speaker says during enrollment or in the audio sample to be verified, as it only extracts voice features to score similarity.

这些 Api 并不用于确定音频是来自活动人员还是模拟/录制的发言人。The APIs are not intended to determine whether the audio is from a live person or an imitation/recording of an enrolled speaker.

说话人识别Speaker Identification

演讲者标识用于在一组已注册的发言人内确定未知扬声器的标识。Speaker Identification is used to determine an unknown speaker’s identity within a group of enrolled speakers. 演讲者标识使你能够将语音特性应用到各个扬声器,并通过多个扬声器的方案解锁价值,如:Speaker Identification enables you to attribute speech to individual speakers, and unlock value from scenarios with multiple speakers, such as:

  • 远程会议工作效率的支持解决方案Support solutions for remote meeting productivity
  • 构建多用户设备个性化Build multi-user device personalization

演讲者识别如何工作?How does Speaker Identification work?

发言人标识的注册与 文本无关,这意味着音频中的演讲者不会有任何限制。Enrollment for speaker identification is text-independent, which means that there are no restrictions on what the speaker says in the audio. 与说话人验证类似,在注册阶段,录制扬声器的声音,并提取语音功能以形成唯一的语音签名。Similar to Speaker Verification, in the enrollment phase the speaker's voice is recorded, and voice features are extracted to form a unique voice signature. 在标识阶段,将输入语音样本与每个请求) 中指定的已注册语音样本 (最多50个指定列表进行比较。In the identification phase, the input voice sample is compared to a specified list of enrolled voices (up to 50 in each request).

数据安全和隐私Data security and privacy

发言人注册数据存储在受保护的系统中,其中包括用于注册的语音音频和语音签名功能。Speaker enrollment data is stored in a secured system, including the speech audio for enrollment and the voice signature features. 仅在升级算法时才使用用于注册的语音音频,需要重新提取功能。The speech audio for enrollment is only used when the algorithm is upgraded, and the features need to be extracted again. 该服务不会保留语音记录或在识别阶段发送到服务的已提取语音功能。The service does not retain the speech recording or the extracted voice features that are sent to the service during the recognition phase.

您可以控制保留数据的时间长度。You control how long data should be retained. 可以通过 API 调用创建、更新和删除各个扬声器的注册数据。You can create, update, and delete enrollment data for individual speakers through API calls. 删除订阅后,与该订阅关联的所有说话人注册数据也会一并删除。When the subscription is deleted, all the speaker enrollment data associated with the subscription will also be deleted.

与所有认知服务资源一样,使用扬声器识别服务的开发人员必须了解 Microsoft 针对客户数据的策略。As with all of the Cognitive Services resources, developers who use the Speaker Recognition service must be aware of Microsoft's policies on customer data. 你应确保已收到用户的相应权限,可以识别扬声器。You should ensure that you have received the appropriate permissions from the users for Speaker Recognition. 有关详细信息,请参阅 Microsoft 信任中心上的 认知服务页   。For more information, see the Cognitive Services page on the Microsoft Trust Center.

常见问题和解决方案Common questions and solutions

问题Question 解决方案Solution
演讲者识别可以使用哪些方案?What scenarios can Speaker Recognition be used for? 呼叫中心客户验证,基于语音的患者签入,会议脚本,多用户设备个性化Call center customer verification, voice-based patient check-in, meeting transcription, multi-user device personalization
标识与验证之间有何区别?What is the difference between Identification and Verification? 标识是指检测一组扬声器中哪个成员正在说话的过程。Identification is the process of detecting which member from a group of speakers is speaking. 验证是确认扬声器与已知的或已 注册 的语音的操作。Verification is the act of confirming that a speaker matches a known, or enrolled voice.
依赖文本和与文本无关的验证之间有何区别?What's the difference between text-dependent and text-independent verification? 与文本相关的验证需要使用特定的传递短语进行注册和识别。Text-dependent verification requires a specific pass-phrase for both enrollment and recognition. 与文本无关的验证需要更长的语音示例进行注册,但可以说出任何内容,包括识别期间。Text-independent verification requires a longer voice sample for enrollment, but anything can be spoken, including during recognition.
支持哪些语言?What languages are supported? 英语、法语、西班牙语、中文、德语、意大利语、日语和葡萄牙语English, French, Spanish, Chinese, German, Italian, Japanese and Portuguese
支持哪些 Azure 区域?What Azure regions are supported? 演讲者识别为预览版服务,目前仅在美国西部地区可用。Speaker Recognition is a preview service, and currently only available in the West US region.
支持哪些格式的音频?What audio formats are supported? Mono 16 位,16kHz PCM-编码的 WAVMono 16 bit, 16kHz PCM-encoded WAV
接受拒绝 响应不准确,如何调整阈值?Accept and Reject responses aren't accurate, how do you tune the threshold? 由于最佳阈值因情况而异,因此 API 决定是根据默认阈值 0.5 "接受" 还是 "拒绝"。Since the optimal threshold varies highly with scenarios, the API decides whether to “Accept” or “Reject” simply based on a default threshold of 0.5. 建议高级用户覆盖默认决策,并根据自己的方案对结果进行微调。Advanced users are advised to override the default decision and fine tune the result based on your own scenario.
是否可以多次注册一个扬声器?Can you enroll one speaker multiple times? 是的,对于文本相关验证,最多可注册50次扬声器。Yes, for text-dependent verification, you can enroll a speaker up to 50 times. 对于与文本无关的验证或发言人标识,最多可注册300秒的音频。For text-independent verification or speaker identification, you can enroll with up to 300 seconds of audio.
Azure 中存储了哪些数据?What data is stored in Azure? 注册音频存储在服务中,直到 删除语音配置文件。Enrollment audio is stored in the service until the voice profile is deleted. 不保留或存储识别音频示例。Recognition audio samples are not retained or stored.

后续步骤Next steps

  • 完成演讲者识别 基础知识文章 ,了解可在应用程序中使用的常见设计模式。Complete the Speaker Recognition basics article for a run-through of common design patterns you can use in your applications.
  • 请参阅 视频教程 ,了解与文本无关的扬声器验证。See the video tutorial for text-independent speaker verification.