我有
df
数据框,其中包含分类特征列 'temp_of_extremities', 'peripheral_pulse', 'mucous_membrane'
。
我想对分类特征进行编码,如下所示:
from sklearn.preprocessing import OneHotEncoder
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
df = ct.fit_transform(df)
但是没有将数据帧转换为数组。
我尝试过应用该方法:
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
for feature in categorical_features:
df = pd.concat([df, pd.get_dummies(df[feature], prefix=feature, dtype='int')], axis=1)
df = df.drop([feature], axis=1)
但是,这不是正确的解决方案,因为将此方法应用于具有相同特征的另一个数据帧时,编码是不同的
如果您有
scikit-learn
版本 1.2.0
或更高版本,您可以使用 set_output
方法返回 pandas.DataFrame
而不是数组。
from functools import partial
import numpy as np # 1.26.2
import pandas as pd # 2.1.4
import sklearn # 1.3.2
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# For repeatability
np.random.seed(0)
# Setting some temporary defaults so I don't have to type out the parameters each time.
choice = partial(np.random.choice, size=(10,), replace=True)
# Make some fake data.
df = pd.DataFrame(
data={
"temp_of_extremities": choice(a=["high", "low", "neutral"]),
"peripheral_pulse": choice(a=[True, False]),
"mucous_membrane": choice(a=[True, False]),
}
)
# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
# I set the `sparse_output` to False otherwise this will raise a ValueError.
transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
remainder='passthrough',
# Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)
print(out)
encoder__temp_of_extremities_high ... encoder__mucous_membrane_True
0 1.0 ... 0.0
1 0.0 ... 0.0
2 1.0 ... 1.0
3 0.0 ... 0.0
4 0.0 ... 1.0
5 0.0 ... 0.0
6 1.0 ... 1.0
7 0.0 ... 0.0
8 1.0 ... 0.0
9 1.0 ... 1.0
为了稍微清理一下列名,我们可以在
verbose_feature_names_out=False
中设置 ColumnTransformer
。
# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
# I set the `sparse_output` to False otherwise this will raise a ValueError.
transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
remainder='passthrough',
# Set `verbose_feature_names_out=False` to keep original names + their encoded value.
verbose_feature_names_out=False,
# Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)
print(out)
temp_of_extremities_high ... mucous_membrane_True
0 1.0 ... 0.0
1 0.0 ... 0.0
2 1.0 ... 1.0
3 0.0 ... 0.0
4 0.0 ... 1.0
5 0.0 ... 0.0
6 1.0 ... 1.0
7 0.0 ... 0.0
8 1.0 ... 0.0
9 1.0 ... 1.0