我有一个数据框(叫做 corpus
),有一列(tweet
)和2行。
['check, tihs, out, this, bear, love, jumping, on, this, plant']
['i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
我有一个列表(叫做 vocab
)的唯一词列。
['check',
'tihs',
'out',
'this',
'bear',
'love',
'jumping',
'on',
'plant',
'i',
'can',
't',
'the',
'noise',
'from',
'that',
'power',
'it',
'make',
'me',
'jump']
我想为vocab中的每个单词添加一列新的列。我希望新列的所有值都是零,除了当 tweet
包含单词,在这种情况下,我希望单词列的值为1。
所以我试着运行下面的代码。
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
...然后显示出以下错误:
"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"
我如何检查tweet是否包含该词,如果是,就将该词的新列值设为1?
你的 corpus['tweet']
是列表类型,每个都是一个骨架。所以 .str.contains
会返回 NaN
. 你可能想这样做。
# turn tweets into strings
corpus["tweet"] = [x[0] for x in corpus['tweet']]
# one-hot-encode
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
但是,这可能不是你想要的,因为: contains
将搜索所有的子串,如 this girl goes to school
将返回 1
两栏 is
和 this
.
根据你的数据,你可以做。
corpus["tweet"] = [x[0] for x in corpus['tweet']]
corpus = corpus.join(corpus['tweet'].str.get_dummies(', ')
.reindex(vocab, axis=1, fill_value=0)
)
这将做。
from sklearn.feature_extraction.text import CountVectorizer
l = ['check, this, out, this, bear, love, jumping, on, this, plant',
'i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
vect = CountVectorizer()
X = pd.DataFrame(vect.fit_transform(l).toarray())
X.columns = vect.get_feature_names()
输出:
bear can check from it jump ... out plant power that the this
0 1 0 1 0 0 0 ... 1 1 0 0 0 3
1 1 1 0 1 1 1 ... 0 1 1 1 1 0