RegEx词汇表不适用于sklearn TfidfVectorizer

Question

我正在尝试计算语料库中所选单词的tf-idf，但是当我对所选单词使用正则表达式时它不起作用。

下面是我从stackoverflow中的另一个问题复制的示例，并进行了小的更改以反映我的问题。

代码粘贴在下面。如果我单独写“巧克力”和“巧克力”，但是如果我写'巧克力|巧克力'，则代码无效。

有人可以帮助我理解为什么并建议可能解决这个问题的方法吗？

keywords = ['tim tam', 'jam', 'fresh milk', 'chocolate|chocolates', 'biscuit pudding']
corpus = {1: "making chocolate biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
tfidf = TfidfVectorizer(vocabulary = keywords, stop_words = 'english', ngram_range=(1,3))
tfs = tfidf.fit_transform(corpus.values())
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
    print((feature_names[col], corpus_index[row]), tfs[row, col])
tfidf_results = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T

我希望结果如下：

('biscuit pudding', 1) 0.652490884512534
('chocolates', 1) 0.3853716274664007
('chocolate', 1) 0.652490884512534
('chocolates', 2) 0.5085423203783267
('tim tam', 2) 0.8610369959439764
('chocolates', 3) 0.5085423203783267
('fresh milk', 3) 0.8610369959439764

但是，现在它返回：

('biscuit pudding', 1) 1.0
('tim tam', 2) 1.0
('fresh milk', 3) 1.0

Answer 1

我猜你正在使用来自scikit-learn的TfidfVectorizer。如果你仔细阅读documentation，无处可说你可以在你的词汇表中使用正则表达式，你能指出你提到的复制的问题吗？

如果要手动将多个术语组合在一起，可以在词汇表中指定映射而不是迭代。例如：

keywords = {'tim tam':0, 'jam':1, 'fresh milk':2, 'chocolate':3, 'chocolates':3, 'biscuit pudding':4]

注意chocolate和chocolates如何映射到相同的索引。

RegEx词汇表不适用于sklearn TfidfVectorizer

问题描述投票：0回答：1

1个回答

最新问题

RegEx词汇表不适用于sklearn TfidfVectorizer

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1