在Python中结合CountVectorizer和ngrams

Question

有一项任务是使用ngrams对男性和女性的名字进行分类。所以，有一个数据帧，如：

    name    is_male
Dorian      1
Jerzy       1
Deane       1
Doti        0
Betteann    0
Donella     0

具体的重新计划是使用

from nltk.util import ngrams

为此任务，创建ngrams（n = 2,3,4）

我制作了一个名单，然后使用了ngrams：

from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

test_ngrams = []
for name in name_list:
    test_ngrams.append(list(ngrams(name,3)))

现在我需要以某种方式将所有这些用于分类，我试试

X_train = count_vect.fit_transform(test_ngrams)

收到：

AttributeError: 'list' object has no attribute 'lower'

我知道列表输入的类型是错误的，有人可以解释我应该怎么做，所以我以后可以使用MultinomialNB。我是以正确的方式做到的吗？提前致谢！

Answer 1

您正在将一系列列表传递给矢量化器，这就是您收到AttributeError的原因。相反，您应该传递一个可迭代的字符串。来自CountVectorizer documentation：

fit_transform（raw_documents，y = None）

学习词汇词典并返回术语 - 文档矩阵。

这相当于fit后跟变换，但更有效地实现。

参数：raw_documents：iterable

可迭代产生str，unicode或文件对象。

要回答你的问题，CountVectorizer能够通过使用ngram_range创建N-gram（以下产生bigrams）：

count_vect = CountVectorizer(ngram_range=(2,2))

corpus = [
    'This is the first document.',
    'This is the second second document.',
]
X = count_vect.fit_transform(corpus)

print(count_vect.get_feature_names())
['first document', 'is the', 'second document', 'second second', 'the first', 'the second', 'this is']

更新：

由于您提到必须使用NLTK生成ngrams，因此我们需要覆盖CountVectorizer的部分默认行为。即，将原始字符串转换为特征的analyzer：

analyzer：string，{'word'，'char'，'char_wb'}或callable

[...]

如果传递了一个callable，它将用于从原始未处理的输入中提取特征序列。

由于我们已经提供了ngrams，因此身份函数就足够了：

count_vect = CountVectorizer(
    analyzer=lambda x:x
)

结合NLTK ngrams和CountVectorizer的完整示例：

corpus = [
    'This is the first document.',
    'This is the second second document.',
]

def build_ngrams(text, n=2):
    tokens = text.lower().split()
    return list(nltk.ngrams(tokens, n))

corpus = [build_ngrams(document) for document in corpus]

count_vect = CountVectorizer(
    analyzer=lambda x:x
)

X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
[('first', 'document.'), ('is', 'the'), ('second', 'document.'), ('second', 'second'), ('the', 'first'), ('the', 'second'), ('this', 'is')]

在Python中结合CountVectorizer和ngrams

问题描述投票：1回答：1

1个回答

最新问题

在Python中结合CountVectorizer和ngrams

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1