预期字节或单码字符串

问题描述 投票:0回答:1

我一直在尝试做文本分类。有2列行动和类别。我已经把数据集分为训练和测试分裂.有某种np.nan是一个无效的文件,预期字节或unicode字符串。

 import numpy as np
    import pandas as pd

    data1 = pd.read_excel("Kumar_doc_prady.xlsx")
    Category1=data1['Category'].unique()

    data1.head(10)
    Out[138]: 
                                                  Action    Category  
    0  1.​Excel based macro would be designed which w...  Automation        
    1  ​Add a checkpoint in the Audit checklist to ch...   Checklist        
    2  ​An excel based macro would be created which w...  Automation        
    3  ​Add a checkpoint in the Audit checklist to ch...   Checklist        
    4  Update the existing automation to delete the u...   Checklist       
    5  Add checkpoints in the existing Audit checklis...   Checklist        
    6  Implement a Peer Audit checklist to verify tha...   Checklist        
    7  ​Checklist audits would be introduced for sele...   CHecklist        
    8  Add a checkpoint in the Audit checklist to che...   Checklist        
    9  Create an Automation to extract SKU related da...   Checklist        


    from sklearn.preprocessing import LabelEncoder
    label = LabelEncoder()
    data1["labels1"] = label.fit_transform(data1["Category"])
    #data1["Category1"] = label.fit_transform(data1["Category1"])
    data1[["Category", "labels1"]].head()
        Out[114]: 
         Category  labels1
    0  Automation        3
    1   Checklist        6
    2  Automation        3
    3   Checklist        6
    4   Checklist        6



    from sklearn.model_selection import train_test_split
    X_train1, X_test1, y_train1, y_test1 = train_test_split(data1['Action'], data1['labels1'], 
    random_state=1)



    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', 
              lowercase=True, stop_words='english')
    X_train1_cv = cv.fit_transform(X_train1)  

我在上面的最后一行得到错误。

    Traceback (most recent call last):

      File "<ipython-input-142-b8096b8dc028>", line 1, in <module>
        X_train1_cv = cv.fit_transform(X_train1)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 1220, in fit_transform
        self.fixed_vocabulary_)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 1131, in _count_vocab
            for feature in analyze(doc):

      File "C:\Users\bcpuser\anaconda3\lib\site- 
    packages\sklearn\feature_extraction\text.py", line 98, in _analyze
        doc = decoder(doc)

      File "C:\Users\bcpuser\anaconda3\lib\site- 
      packages\sklearn\feature_extraction\text.py", line 218, in decode
        raise ValueError("np.nan is an invalid document, expected byte or "



       ValueError: np.nan is an invalid document, expected byte or unicode 
      string.

这似乎是某种对象错误

python machine-learning scikit-learn text-classification
1个回答
0
投票

使用对象类型转换为unicode类型

X_train1_cv = cv.fit_transform(X_train1.values.astype('U'))
© www.soinside.com 2019 - 2024. All rights reserved.