我想对 Pandas 数据帧的分类特征进行 one-hot 编码。以前,值存储形状为 (60,) 的变量。请参阅下面的代码:
ohe_features = ["Gender", "Married", "Self_Employed"]
num_features = ["Dependents"]
df = pd.get_dummies(df, columns=ohe_features, dtype=int)
调用
get_dummies
后,df
现在具有以下形状的列:
Column 'Gender_Female' has shape (60, 2)
Column 'Gender_Male' has shape (60, 2)
Column 'Married_No' has shape (60, 2)
Column 'Married_Yes' has shape (60, 2)
Column 'Self_Employed_No' has shape (60, 2)
Column 'Self_Employed_Yes' has shape (60, 2)
如何在不改变特征原始维度的情况下对分类变量进行编码?
可重现示例:
Dependents Gender Married Self_Employed
0 Female Yes No
如果你想要特征的原始尺寸,你需要sklearn预处理:LabelEncoder()。但是,您需要知道 LabelEncoder() 和 pandas get_dummies 之间有什么区别:
LabelEncoder() 示例:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied
Arch','Suspension','Cable')
df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
labelencoder = LabelEncoder()
df['Bridge_Types_Cat'] = labelencoder.fit_transform(df['Bridge_Types'])
df
更多信息:链接