如何在不将数据帧转换为数组的情况下执行一次热编码?

问题描述 投票:0回答:1

我有

df
数据框,其中包含分类特征列
'temp_of_extremities', 'peripheral_pulse', 'mucous_membrane'
。 我想对分类特征进行编码,如下所示:

from sklearn.preprocessing import OneHotEncoder
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
df = ct.fit_transform(df)

但是没有将数据帧转换为数组。

我尝试过应用该方法:

categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
for feature in categorical_features:
    df = pd.concat([df, pd.get_dummies(df[feature], prefix=feature, dtype='int')], axis=1)
    df = df.drop([feature], axis=1)

但是,这不是正确的解决方案,因为将此方法应用于具有相同特征的另一个数据帧时,编码是不同的

python dataframe one-hot-encoding
1个回答
0
投票

如果您有

scikit-learn
版本
1.2.0
或更高版本,您可以使用
set_output
方法返回
pandas.DataFrame
而不是数组。


示例

from functools import partial

import numpy as np  # 1.26.2
import pandas as pd  # 2.1.4
import sklearn  # 1.3.2
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


# For repeatability
np.random.seed(0)

# Setting some temporary defaults so I don't have to type out the parameters each time.
choice = partial(np.random.choice, size=(10,), replace=True)

# Make some fake data.
df = pd.DataFrame(
    data={
        "temp_of_extremities": choice(a=["high", "low", "neutral"]),
        "peripheral_pulse": choice(a=[True, False]),
        "mucous_membrane": choice(a=[True, False]),
    }
)

# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
    # I set the `sparse_output` to False otherwise this will raise a ValueError.
    transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
    remainder='passthrough',
    # Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)

print(out)
   encoder__temp_of_extremities_high  ...  encoder__mucous_membrane_True
0                                1.0  ...                            0.0
1                                0.0  ...                            0.0
2                                1.0  ...                            1.0
3                                0.0  ...                            0.0
4                                0.0  ...                            1.0
5                                0.0  ...                            0.0
6                                1.0  ...                            1.0
7                                0.0  ...                            0.0
8                                1.0  ...                            0.0
9                                1.0  ...                            1.0

为了稍微清理一下列名,我们可以在

verbose_feature_names_out=False
中设置
ColumnTransformer

# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
    # I set the `sparse_output` to False otherwise this will raise a ValueError.
    transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
    remainder='passthrough',
    # Set `verbose_feature_names_out=False` to keep original names + their encoded value.
    verbose_feature_names_out=False,
    # Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)

print(out)
   temp_of_extremities_high  ...  mucous_membrane_True
0                       1.0  ...                   0.0
1                       0.0  ...                   0.0
2                       1.0  ...                   1.0
3                       0.0  ...                   0.0
4                       0.0  ...                   1.0
5                       0.0  ...                   0.0
6                       1.0  ...                   1.0
7                       0.0  ...                   0.0
8                       1.0  ...                   0.0
9                       1.0  ...                   1.0

资源

© www.soinside.com 2019 - 2024. All rights reserved.