ValueError:值的长度与嵌套循环中索引的长度不匹配

问题描述 投票:0回答:1

我正在尝试删除列中每一行的停用词。列包含行和行,因为我已经用word_tokenized将其nltk了,那么现在它是一个包含元组的列表。我正在尝试使用此嵌套列表理解来删除停用词,但它显示为ValueError: Length of values does not match length of index in nested loop。如何解决这个问题?

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
                    encoding = "latin-1")

data = data[['v1','v2']]

data = data.rename(columns = {'v1': 'label', 'v2': 'text'})

stopwords = set(stopwords.words('english'))

data['text'] = data['text'].str.lower()
data['new'] = [word_tokenize(row) for row in data['text']]
data['new'] = [word for new in data['new'] for word in new if word not in stopwords]

我的文本数据

data['text'].head(5)
Out[92]: 
0    go until jurong point, crazy.. available only ...
1                        ok lar... joking wif u oni...
2    free entry in 2 a wkly comp to win fa cup fina...
3    u dun say so early hor... u c already then say...
4    nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object

[i word_tokenized用nltk之后

data['new'].head(5)
Out[89]: 
0    [go, until, jurong, point, ,, crazy.., availab...
1             [ok, lar, ..., joking, wif, u, oni, ...]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, ..., u, c, alrea...
4    [nah, i, do, n't, think, he, goes, to, usf, ,,...
Name: new, dtype: object

回溯

runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')
Traceback (most recent call last):

  File "D:\python projects\NLP_nltk_first.py", line 36, in <module>
    data['new'] = [new for new in data['new'] for word in new if word not in stopwords]

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__
    self._set_item(key, value)

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3564, in _set_item
    value = self._sanitize_column(key, value)

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3749, in _sanitize_column
    value = sanitize_index(value, self.index, copy=False)

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 612, in sanitize_index
    raise ValueError("Length of values does not match length of index")

ValueError: Length of values does not match length of index
python pandas for-loop nltk list-comprehension
1个回答
0
投票

仔细阅读错误消息:

ValueError:值的长度与索引的长度不匹配

在这种情况下,“值”是=右侧的内容:

values = [word for new in data['new'] for word in new if word not in stopwords]

在这种情况下,“索引”是DataFrame的行索引:

index = data.index

这里的index总是与DataFrame本身具有相同的行数。

问题是values对于index来说太长了-也就是说,对于DataFrame来说它们太长了。如果您检查您的代码,这应该立即显而易见。如果仍然看不到问题,请尝试以下操作:

data['text_tokenized'] = [word_tokenize(row) for row in data['text']]

values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]

print('N rows:', data.shape[0])
print('N new values:', len(values))

关于如何解决问题-这完全取决于您要实现的目标。一种选择是“分解”数据(也请注意使用.map而不是列表理解):

data['text_tokenized'] = data['text'].map(word_tokenize)

tokens_flat = data['text_tokenized'].explode()
data_flat = data[['label']].join(tokens_flat)

作为不相关的提示,您可以通过仅加载所需的列来提高CSV处理的效率,如下所示:

data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
                    encoding="latin-1",
                    usecols=["v1", "v2"])
© www.soinside.com 2019 - 2024. All rights reserved.