我对Tfidftransformer和Tfidfvectorizer的用法有点困惑。Tfidftransformer
&amp。Tfidfvectorizer
因为它们看起来都很相似。一个是用文字来转换矩阵(Tfidfvectorizer
),另一个使用已经转换的文本(使用 CountVectorizer
)到矩阵。
谁能解释一下这两者之间的区别?
CountVectorizer + TfidfTransformer = TfidfVectorizer
这是简单实用的理解方式。TfidfVectorizer一步完成CountVectorizer和TfidfTransformer。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
# transformer a
a = Pipeline(steps =[
('count_verctorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
# transformer b
b = TfidfVectorizer()
a
和 b
变换器将进行同样的变换。
如果在向模型输入特征之前,预处理只包括TFIDF,那么 b
将是最佳选择。但有些时候,我们希望将预处理分开。例如,我们想在做反转文档频率之前只保留最佳术语。在这种情况下,我们会选择 a
. 因为我们可以执行CountVectorizer然后在IDF之前做额外的预处理。例如
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
# do counter terms and allow max 150k terms with 1-2 Ngrams
# select the best 10K (reducing the size of our features)
# do the IDF and the pass to our model
hisia = Pipeline(steps =[
('count_verctorizer', CountVectorizer(ngram_range=(1, 2),
max_features=150000,
)
),
('feature_selector', SelectKBest(chi2, k=10000)),
('tfidf', TfidfTransformer(sublinear_tf=True)),
('logistic_regression', LogisticRegressionCV(cv=5,
solver='saga',
scoring='accuracy',
max_iter=200,
n_jobs=-1,
random_state=42,
verbose=0))
])
在这个例子中,我们在将术语传递给IDF之前,先进行了特征选择。之所以能够做到这一点,是因为我们可以先对TFIDF进行拆分,然后再进行 CountVectorizer
和 TfidfTransformer