"[Float64Index([nan，nan]，dtype='float64')]中没有一个[索引]"如果col B中包含字符串，则设置col A的值。

Question

我有一个数据框（叫做 corpus)，有一列(tweet)和2行。

['check, tihs, out, this, bear, love, jumping, on, this, plant']
['i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']

我有一个列表（叫做 vocab)的唯一词列。

['check',
 'tihs',
 'out',
 'this',
 'bear',
 'love',
 'jumping',
 'on',
 'plant',
 'i',
 'can',
 't',
 'the',
 'noise',
 'from',
 'that',
 'power',
 'it',
 'make',
 'me',
 'jump']

我想为vocab中的每个单词添加一列新的列。我希望新列的所有值都是零，除了当 tweet 包含单词，在这种情况下，我希望单词列的值为1。

所以我试着运行下面的代码。

for word in vocab:
    corpus[word] = 0
    corpus.loc[corpus["tweet"].str.contains(word), word] = 1

...然后显示出以下错误：

"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"

我如何检查tweet是否包含该词，如果是，就将该词的新列值设为1？

Answer 1

你的 corpus['tweet'] 是列表类型，每个都是一个骨架。所以 .str.contains 会返回 NaN. 你可能想这样做。

# turn tweets into strings
corpus["tweet"] = [x[0] for x in corpus['tweet']]

# one-hot-encode
for word in vocab:
    corpus[word] = 0
    corpus.loc[corpus["tweet"].str.contains(word), word] = 1

但是，这可能不是你想要的，因为： contains 将搜索所有的子串，如 this girl goes to school 将返回 1 两栏 is 和 this.

根据你的数据，你可以做。

corpus["tweet"] = [x[0] for x in corpus['tweet']]

corpus = corpus.join(corpus['tweet'].str.get_dummies(', ')
                         .reindex(vocab, axis=1, fill_value=0)
                    )

Answer 2

这将做。

from sklearn.feature_extraction.text import CountVectorizer

l = ['check, this, out, this, bear, love, jumping, on, this, plant',
'i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
vect = CountVectorizer()
X = pd.DataFrame(vect.fit_transform(l).toarray())
X.columns = vect.get_feature_names()

输出：

bear  can  check  from  it  jump  ...  out  plant  power  that  the  this
0     1    0      1     0   0     0  ...    1      1      0     0    0     3
1     1    1      0     1   1     1  ...    0      1      1     1    1     0

"[Float64Index([nan，nan]，dtype='float64')]中没有一个[索引]"如果col B中包含字符串，则设置col A的值。

问题描述投票：0回答：2

2个回答

最新问题

"[Float64Index([nan，nan]，dtype='float64')]中没有一个[索引]"如果col B中包含字符串，则设置col A的值。

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2