LabelEncoder在估算缺少的值后无法进行逆变换(看不见的标签)

问题描述 投票:0回答:1

我处于初级到中级数据科学水平。我想使用knn估算数据框中的缺失值。

由于数据帧包含字符串和floats,因此我需要使用LabelEncoder编码/解码值。

我的方法如下:

  1. 替换NaN以便进行编码
  2. 编码文本值并将其放入字典中
  3. 检索要用knn插值的NaN(先前转换)
  4. 使用knn分配值
  5. 解码字典中的值

[不幸的是,在最后一步中,估算值添加了无法解码的新值(unseen labels错误消息)。

您能告诉我我做错了什么吗?理想情况下请帮助我进行更正。在结束之前,我想说的是我知道还有其他工具,例如OneHotEncoder,但我对它们的了解还不够,我发现LabelEncoder更加直观,因为您可以在数据框中直接看到它(其中LabelEncoder提供一个数组)。

请在下面找到我的方法的示例,非常感谢您的帮助

[1]

# Import libraries. 
import pandas as pd 
import numpy as np

# intialise data of lists. 
data = {'Name':['Jack', np.nan, 'Victoria', 'Nicolas', 'Victor', 'Brad'], 'Age':[59, np.nan, 29, np.nan, 65, 50], 'Car color':['Blue', 'Black', np.nan, 'Black', 'Grey', np.nan], 'Height ':[177, 150, np.nan, 180, 175, 190]} 

# Make a DataFrame 
df = pd.DataFrame(data) 

# Print the output. 
df 

Output : 
    Name    Age     Car color   Height
0   Jack    59.0    Blue    177.0
1   NaN     NaN     Black   150.0
2   Victoria    29.0    NaN     NaN
3   Nicolas     NaN     Black   180.0
4   Victor  65.0    Grey    175.0
5   Brad    50.0    NaN     190.0

[2]

# LabelEncoder does not work with NaN values, so I replace them with value '1000' : 
df = df.replace(np.nan, 1000)

# And to avoid errors, str columns must be set as strings (even '1000' value) : 
df[['Name','Car color']] = df[['Name','Car color']].astype(str)

df

Output 
    Name    Age     Car color   Height
0   Jack    59.0    Blue    177.0
1   1000    1000.0  Black   150.0
2   Victoria    29.0    1000    1000.0
3   Nicolas     1000.0  Black   180.0
4   Victor  65.0    Grey    175.0
5   Brad    50.0    1000    190.0

[3]

# Import LabelEncoder library : 
from sklearn.preprocessing import LabelEncoder

# define labelencoder :
le = LabelEncoder()

# Import defaultdict library to make a dict of labelencoder :
from collections import defaultdict

# Initiate a dict of LabelEncoder values :
encoder_dict = defaultdict(LabelEncoder)

# Make a new dataframe of LabelEncoder values :
df[['Name','Car color']] = df[['Name','Car color']].apply(lambda x: encoder_dict[x.name].fit_transform(x))

# Show output :
df

Output 
    Name    Age     Car color   Height
0   2   59.0    2   177.0
1   0   1000.0  1   150.0
2   5   29.0    0   1000.0
3   3   1000.0  1   180.0
4   4   65.0    3   175.0
5   1   50.0    0   190.0

[4]

#Reverse back 1000 to missing values in order to impute them : 
df = df.replace(1000, np.nan)
df

Output 

    Name    Age     Car color   Height
0   2   59.0    2   177.0
1   0   NaN     1   150.0
2   5   29.0    0   NaN
3   3   NaN     1   180.0
4   4   65.0    3   175.0
5   1   50.0    0   190.0

[5]

# Import knn imputer library to replace impute missing values : 
from sklearn.impute import KNNImputer

# Define imputer : 
imputer = KNNImputer(n_neighbors=2)

# impute and reassign index/colonnes : 
df = pd.DataFrame(np.round(imputer.fit_transform(df)),columns = df.columns)
df

Output 

    Name    Age     Car color   Height
0   2.0     59.0    2.0     177.0
1   0.0     47.0    1.0     150.0
2   5.0     29.0    0.0     165.0
3   3.0     44.0    1.0     180.0
4   4.0     65.0    3.0     175.0
5   1.0     50.0    0.0     190.0

[6]

# Decode data : 
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)

# Apply it to df -> THIS IS WHERE ERROR OCCURS :
df[['Name','Car color']].apply(inverse_transform_lambda)

错误消息:

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-55-8a5e369215f6> in <module>()
----> 1 df[['Name','Car color']].apply(inverse_transform_lambda)

5 frames

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6926             kwds=kwds,
   6927         )
-> 6928         return op.get_result()
   6929 
   6930     def applymap(self, func):

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
    290 
    291         # compute the result using the series generator
--> 292         self.apply_series_generator()
    293 
    294         # wrap results

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_series_generator(self)
    319             try:
    320                 for i, v in enumerate(series_gen):
--> 321                     results[i] = self.f(v)
    322                     keys.append(v.name)
    323             except Exception as e:

<ipython-input-54-f16f4965b2c4> in <lambda>(x)
----> 1 inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)

/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
    297                     "y contains previously unseen labels: %s" % str(diff))
    298         y = np.asarray(y)
--> 299         return self.classes_[y]
    300 
    301     def _more_tags(self):

IndexError: ('arrays used as indices must be of integer (or boolean) type', 'occurred at index Name')
python scikit-learn imputation label-encoding
1个回答
1
投票

根据我的评论,您应该这样做

# Decode data : 
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x.astype(int)) # or x[].astype(int)
© www.soinside.com 2019 - 2024. All rights reserved.