您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

什么是 BLEU 分数?What is a BLEU score?

BLEU(双语评估候补)用于度量同一源语句的自动翻译与一个或多个人工创建的参考翻译之间的差异。BLEU (Bilingual Evaluation Understudy) is a measurement of the differences between an automatic translation and one or more human-created reference translations of the same source sentence.

评分过程Scoring process

BLEU 算法将自动翻译的连续短语与它在参考翻译中找到的连续短语进行比较,并以加权方式对匹配项数进行计数。The BLEU algorithm compares consecutive phrases of the automatic translation with the consecutive phrases it finds in the reference translation, and counts the number of matches, in a weighted fashion. 这些匹配项与位置无关。These matches are position independent. 匹配度越高表示与参考翻译的相似度越高,分数也越高。A higher match degree indicates a higher degree of similarity with the reference translation, and higher score. 不会考虑可理解性和语法正确性。Intelligibility and grammatical correctness are not taken into account.

BLEU 如何工作?How BLEU works?

BLEU 的优势在于,它可以根据测试语料库对各个语句判断错误进行平均来与人为判断建立密切关联,而不试图为每个语句建议确切的人为判断。BLEU’s strength is that it correlates well with human judgment by averaging out individual sentence judgment errors over a test corpus, rather than attempting to devise the exact human judgment for every sentence.

此处更详细地讨论了 BLEU 分数。A more extensive discussion of BLEU scores is here.

BLEU 结果在很大程度上取决于你的域的范围、测试数据与训练和优化数据之间的一致性,以及可用于训练的数据量。BLEU results depend strongly on the breadth of your domain, the consistency of the test data with the training and tuning data, and how much data you have available to train. 如果模型是基于范围很小的域训练的,并且训练数据与测试数据一致,则可以预期得到较高的 BLEU 分数。If your models have been trained on a narrow domain, and your training data is consistent with your test data, you can expect a high BLEU score.

备注

只有使用相同的测试集、相同的语言对和相同的 MT 引擎比较 BLEU 结果时,BLEU 分数之间的比较才有意义。A comparison between BLEU scores is only justifiable when BLEU results are compared with the same Test set, the same language pair, and the same MT engine. 使用的测试集不同,BLEU 分数也必定不同。A BLEU score from a different test set is bound to be different.