了解TfidfVectorizer中的前n个tfidf功能

Question

我想更好地理解TfidfVectorizer的scikit-learn。以下代码具有两个文档doc1 = The car is driven on the road，doc2 = The truck is driven on the highway。通过调用fit_transform，可以生成tf-idf权重的矢量化矩阵。

根据tf-idf值矩阵，不是highway,truck,car应该作为highway,truck,driven而不是highway = truck= car= 0.63 and driven = 0.44做为主要词汇？

#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)

feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())

sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)

#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)

['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672  0.44943642 0.         0.6316672  0.        ]
 [0.         0.44943642 0.6316672  0.         0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']

了解TfidfVectorizer中的前n个tfidf功能

问题描述投票：0回答：1

1个回答

最新问题

了解TfidfVectorizer中的前n个tfidf功能

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1