在管道/网格搜索中使用TFI / DF和CountVectorizer

Question

我正在尝试在一个管道中使用TFI / DF和CountVectorizer。我做了以下事情：

pipe = Pipeline([
    ('tfic', TfidfVectorizer()),
    ('cvec',  CountVectorizer()),
    ('lr' ,LogisticRegression())
])

和参数：

pipe_parms = {
    'cvec__max_features' : [100,500],
    'cvec__ngram_range' : [(1,1),(1,2)],
    'cvec__stop_words' : [ 'english', None]
}

gridSearch：

gs = GridSearchCV(pipe, param_grid= pipe_parms, cv=3)

我有一个错误

未找到下层。

使用countVectorizer或TfidfVectorizer有效，但不能同时使用。

我阅读了关于stackoverflow的其他问题，他们表示如果我希望两个人都使用一个管道工作，则应该使用TfidfTransformer()。这样做，我得到一个错误“无法将字符串转换为浮点数”

是否有一种方法可以在一个管道中使用两个矢量？或您建议其他什么方法。

谢谢

编辑：我找到了一个使用FeatureUnion组合两个并行转换器（在这种情况下为count和Tfidf矢量化器）的解决方案。我在这里写了一篇简短的博客文章：https://link.medium.com/OPzIU0T3N0

Answer 1

希望我的解释使您更清楚这里发生的事情。

[您首先尝试应用TfidfVectorizer变换。这会将文本集合更改为由数字组成的TfidfVector。假设您有此文本列表

texts = [ 'I am a bird', 'a crow is a bird', 'bird fly high in the sky', 'bird bird bird', 'black bird in the dead of night', 'crow is black bird' ]

正在运行TfidfVectorizer().fit_transform(texts).todense()

将产生

matrix([[0.91399636, 0.40572238, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0. , 0.35748457, 0. , 0.66038049, 0. , 0. , 0. , 0. , 0.66038049, 0. , 0. , 0. , 0. ], ...])

然后，从这个数字矩阵中，您尝试应用CountVectorizer，我认为这不是您想要的。没有Pipeline，您的代码将类似于CountVectorizer().fit_transform(
    TfidfVectorizer().fit_transform(texts).todense()
)

根据scikit-learn's documentation CountVectorizer接受字符串或字节序列，而不是数字。

是否可以在一个管道中使用两个矢量化器？或您建议采用什么其他方法？

我建议您使用CountVectorizer或TfidfVectorizer之一，请勿在1个流水线中同时使用。用外行术语来说，CountVectorizer将输出您通过的字符串集合中每个单词的频率，而TfidfVectorizer还将输出每个单词的
normalized频率。话虽这么说，这两种方法都具有相同的目的：使用频率将文本集合转换为数字。因此，您应该只使用其中之一。
如果您详细说明，将很乐意添加我的答案，为什么要在一个管道中同时使用两个矢量化器。

在管道/网格搜索中使用TFI / DF和CountVectorizer

问题描述投票：0回答：1

1个回答

最新问题

在管道/网格搜索中使用TFI / DF和CountVectorizer

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1