我正在尝试预处理数据。我填补了缺失的价值观。但是,当我尝试将分类数据编码为整数时,X数据集已正确编码,但是在y列中出现错误。到目前为止,关于该主题的文章还很少。请帮助。
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Taking care of missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
# Encoding categorical data
# Encoding the Independent Variable
#from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ct = ColumnTransformer(
[('one_hot_encoder', ohe, [0])],
remainder='passthrough'
)
print(dataset)
x = np.array(ct.fit_transform(x), dtype=np.int)
y = np.array(ct.fit_transform(y), dtype=np.int)```
[error image][1]
[1]: https://i.stack.imgur.com/YPR66.png
y
是您的目标变量,即您要预测的变量。这是一维数组,如果调用y.shape
,则会得到
>>>y.shape
(10,)
这就是为什么您可能出现索引错误-y.shape[1]
越界。
您不应该对目标变量进行一次热编码,而是对其进行目标编码。也就是说,将最后一行替换为:
y = pd.Categorical(y).codes
然后y
将是
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int8)
0
对应于“未购买”,1
对应于“购买”