我一直在尝试做文本分类。有2列行动和类别。我已经把数据集分为训练和测试分裂.有某种np.nan是一个无效的文件,预期字节或unicode字符串。
import numpy as np
import pandas as pd
data1 = pd.read_excel("Kumar_doc_prady.xlsx")
Category1=data1['Category'].unique()
data1.head(10)
Out[138]:
Action Category
0 1.Excel based macro would be designed which w... Automation
1 Add a checkpoint in the Audit checklist to ch... Checklist
2 An excel based macro would be created which w... Automation
3 Add a checkpoint in the Audit checklist to ch... Checklist
4 Update the existing automation to delete the u... Checklist
5 Add checkpoints in the existing Audit checklis... Checklist
6 Implement a Peer Audit checklist to verify tha... Checklist
7 Checklist audits would be introduced for sele... CHecklist
8 Add a checkpoint in the Audit checklist to che... Checklist
9 Create an Automation to extract SKU related da... Checklist
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data1["labels1"] = label.fit_transform(data1["Category"])
#data1["Category1"] = label.fit_transform(data1["Category1"])
data1[["Category", "labels1"]].head()
Out[114]:
Category labels1
0 Automation 3
1 Checklist 6
2 Automation 3
3 Checklist 6
4 Checklist 6
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(data1['Action'], data1['labels1'],
random_state=1)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b',
lowercase=True, stop_words='english')
X_train1_cv = cv.fit_transform(X_train1)
我在上面的最后一行得到错误。
Traceback (most recent call last):
File "<ipython-input-142-b8096b8dc028>", line 1, in <module>
X_train1_cv = cv.fit_transform(X_train1)
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 1220, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 1131, in _count_vocab
for feature in analyze(doc):
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 98, in _analyze
doc = decoder(doc)
File "C:\Users\bcpuser\anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 218, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode
string.
这似乎是某种对象错误
使用对象类型转换为unicode类型
X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))