tf-idf模型如何在测试数据期间处理看不见的单词?

问题描述 投票:0回答:1

我读过很多博客,但对答案不满意,假设我在一些文档示例中训练了tf-idf模型:

   " John like horror movie."
   " Ryan watches dramatic movies"
    ------------so on ----------

我使用此功能:

   from sklearn.feature_extraction.text import TfidfTransformer
   count_vect = CountVectorizer()
   X_train_counts = count_vect.fit_transform(twenty_train.data)
   X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
   print((X_train_counts.todense()))
   # Gives count of words in each document

   But it doesn't tell which word? How to get words as headers in X_train_counts 
  outputs. Similarly in X_train_tfidf ?

因此X_train_tfidf输出将是具有tf-idf分数的矩阵:

     Horror  watch  movie  drama
doc1  score1  --    -----------
doc2   ------------------------

这是正确的吗?

fit做什么,transformation做什么?在sklearn中提到:

fit(..)方法使我们的估计量适合数据,其次是transform(..)方法将我们的计数矩阵转换为tf-idf表示形式。estimator to the data是什么意思?

现在假设有新的测试文件出现:

    " Ron likes thriller movies"

如何将此文档转换为tf-idf?我们不能将其转换为tf-idf对吗?如何处理火车文档中不存在的单词thriller

python-3.x scikit-learn tf-idf
1个回答
0
投票

以两个文本作为输入

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

text = ["John like horror movie","Ryan watches dramatic movies"]

count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(text)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

pd.DataFrame(X_train_tfidf.todense(), columns = count_vect.get_feature_names())

o / p

        dramatic    horror      john        like        movie       movies      ryan    watches
   0    0.000000    0.471078    0.471078    0.471078    0.471078    0.335176    0.000000    0.000000
   1    0.363788    0.000000    0.000000    0.000000    0.000000    0.776515    0.363788    0.363788

现在测试它的新注释,我们需要使用转换功能,词汇量不足的单词在向量化时将被忽略。

new_comment = ["ron don't like dramatic movie"]

pd.DataFrame(tfidf_transformer.transform(count_vect.transform(new_comment)).todense(), columns = count_vect.get_feature_names())


    dramatic    horror  john    like    movie   movies  ryan    watches
0   0.57735      0.0    0.0    0.57735  0.57735   0.0   0.0      0.0
© www.soinside.com 2019 - 2024. All rights reserved.