我正在尝试使用 NLTK 的
sentence_bleu()
函数评估中文句子 BLEU 分数。代码如下:
import nltk
import jieba
from transformers import AutoTokenizer, BertTokenizer, BartForConditionalGeneration
src = '樓上漏水耍花招不處理可以怎麼做'
ref = '上層漏水耍手段不去處理可以怎麼做'
checkpoint = 'fnlp/bart-base-chinese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint)
hypothesis_translations = []
for sentence in [src]:
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
outputs = model.generate(**inputs)
translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
hypothesis_translations.append(translated_sentence)
# for Reference tokenization
inputs_ref = tokenizer(ref, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
outputs_ref = model.generate(**inputs_ref)
tokenized_ref = tokenizer.decode(outputs_ref[0], skip_special_tokens=True)
nltk_bleu = nltk.translate.bleu_score.sentence_bleu(tokenized_ref, hypothesis_translations)
print(nltk_bleu)
打印
nltk_bleu
的输出是0
。
但是当我使用
corpus_score()
库的 SacreBLEU
时,它返回正常和预期的结果:
import evaluate
from sacrebleu.metrics import BLEU
bleu = BLEU()
bleu_score = bleu.corpus_score(references=tokenized_ref, hypotheses=hypothesis_translations)
print(bleu_score)
返回:
BLEU = 4.79 73.3/3.6/1.9/1.0(BP = 1.000 比率 = 15.000 hyp_len = 15 ref_len = 1)
如何让 NLTK
sentence_score
返回正确的结果?
很明显,
SacreBLEU
使用了某种平滑,而 NLTK
则没有。
我下载了
SacreBLEU
并查看了 BLEU
的默认设置:
def __init__(self, lowercase: bool = False,
force: bool = False,
tokenize: Optional[str] = None,
smooth_method: str = 'exp',
smooth_value: Optional[float] = None,
max_ngram_order: int = MAX_NGRAM_ORDER,
effective_order: bool = False,
trg_lang: str = '',
references: Optional[Sequence[Sequence[str]]] = None):
...
@staticmethod
def compute_bleu(correct: List[int],
total: List[int],
sys_len: int,
ref_len: int,
smooth_method: str = 'none',
smooth_value=None,
effective_order: bool = False,
max_ngram_order: int = MAX_NGRAM_ORDER) -> BLEUScore:
"""Computes BLEU score from its sufficient statistics with smoothing.
Smoothing methods (citing "A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU",
Boxing Chen and Colin Cherry, WMT 2014: http://aclweb.org/anthology/W14-3346)
- none: No smoothing.
- floor: Method 1 (requires small positive value (0.1 in the paper) to be set)
- add-k: Method 2 (Generalizing Lin and Och, 2004)
- exp: Method 3 (NIST smoothing method i.e. in use with mteval-v13a.pl)
从中我们看到
SacreBLEU
默认使用“方法3”进行舒缓。
现在让我们看看
NLTK
的版本:
help(nltk.translate.bleu_score.sentence_bleu)
...
To avoid this harsh behaviour when no ngram overlaps are found a smoothing
function can be used.
>>> chencherry = SmoothingFunction()
>>> sentence_bleu([reference1, reference2, reference3], hypothesis2,
... smoothing_function=chencherry.method1) # doctest: +ELLIPSIS
0.0370...
...
这个
SmoothingFunction
对象实现了引用文章中的所有平滑方法。如上所述,您将需要method3
:
help(nltk.translate.bleu_score.SmoothingFunction.method3)
Help on function method3 in module nltk.translate.bleu_score:
method3(self, p_n, *args, **kwargs)
Smoothing method 3: NIST geometric sequence smoothing
The smoothing is computed by taking 1 / ( 2^k ), instead of 0, for each
precision score whose matching n-gram count is null.
k is 1 for the first 'n' value for which the n-gram match count is null/
For example, if the text contains:
- one 2-gram match
- and (consequently) two 1-gram matches
the n-gram count for each individual precision score would be:
- n=1 => prec_count = 2 (two unigrams)
- n=2 => prec_count = 1 (one bigram)
- n=3 => prec_count = 1/2 (no trigram, taking 'smoothed' value of 1 / ( 2^k ), with k=1)
- n=4 => prec_count = 1/4 (no fourgram, taking 'smoothed' value of 1 / ( 2^k ), with k=2)