这是我第一次访问sklearn库,说实话,由于我在互联网上发现了无数种“做事方式”,所以我脑子里一团糟。所以我有清理过的数据库,看起来像这样:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 246858 entries, 2 to 371527
Data columns (total 11 columns):
name 246858 non-null object
price 246858 non-null int64
vehicleType 246858 non-null object
yearOfRegistration 246858 non-null int64
gearbox 246858 non-null object
powerPS 246858 non-null int64
model 246858 non-null object
kilometer 246858 non-null int64
fuelType 246858 non-null object
brand 246858 non-null object
notRepairedDamage 246858 non-null object
dtypes: int64(4), object(7)
memory usage: 22.6+ MB
所以我想继续对价格变量进行分类。显然,我必须对以下类别进行编码:
categorical = ['name', 'vehicleType', 'gearbox', 'model', 'fuelType', 'brand', 'notRepairedDamage']
这是问题。我总是遇到内存错误。我尝试使用数据框映射器:
encoding = DataFrameMapper([
(['name', 'vehicleType', 'gearbox', 'model', 'fuelType', 'brand', 'notRepairedDamage'],
OneHotEncoder(handle_unknown='ignore')),
(["price", "yearOfRegistration", "powerPS", "kilometer"], None)
])
encoding_target = DataFrameMapper([
(['price'], none)
])
现在,这很好,假设我想尝试分类树,我必须创建训练并进行测试,但是在必须应用转换之前:
X = encoding.transform(data.loc[:, data.columns != "price"])
在这一点上,我遇到了内存错误。我不知道]
尝试一下:
for cat in categorical:
data[cat] = data[cat].astype('category')
data[cat] = data[cat].cat.codes