如何在不将数据帧转换为数组的情况下执行一次热编码？

Question

我有

df

数据框，其中包含分类特征列

'temp_of_extremities', 'peripheral_pulse', 'mucous_membrane'

。我想对分类特征进行编码，如下所示：

from sklearn.preprocessing import OneHotEncoder
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
df = ct.fit_transform(df)

但是没有将数据帧转换为数组。

我尝试过应用该方法：

categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
for feature in categorical_features:
    df = pd.concat([df, pd.get_dummies(df[feature], prefix=feature, dtype='int')], axis=1)
    df = df.drop([feature], axis=1)

但是，这不是正确的解决方案，因为将此方法应用于具有相同特征的另一个数据帧时，编码是不同的

Answer 1

如果您有

scikit-learn

版本

1.2.0

或更高版本，您可以使用

set_output

方法返回

pandas.DataFrame

而不是数组。

示例

from functools import partial

import numpy as np  # 1.26.2
import pandas as pd  # 2.1.4
import sklearn  # 1.3.2
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


# For repeatability
np.random.seed(0)

# Setting some temporary defaults so I don't have to type out the parameters each time.
choice = partial(np.random.choice, size=(10,), replace=True)

# Make some fake data.
df = pd.DataFrame(
    data={
        "temp_of_extremities": choice(a=["high", "low", "neutral"]),
        "peripheral_pulse": choice(a=[True, False]),
        "mucous_membrane": choice(a=[True, False]),
    }
)

# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
    # I set the `sparse_output` to False otherwise this will raise a ValueError.
    transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
    remainder='passthrough',
    # Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)

print(out)

   encoder__temp_of_extremities_high  ...  encoder__mucous_membrane_True
0                                1.0  ...                            0.0
1                                0.0  ...                            0.0
2                                1.0  ...                            1.0
3                                0.0  ...                            0.0
4                                0.0  ...                            1.0
5                                0.0  ...                            0.0
6                                1.0  ...                            1.0
7                                0.0  ...                            0.0
8                                1.0  ...                            0.0
9                                1.0  ...                            1.0

为了稍微清理一下列名，我们可以在

verbose_feature_names_out=False

中设置

ColumnTransformer

。

# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
    # I set the `sparse_output` to False otherwise this will raise a ValueError.
    transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
    remainder='passthrough',
    # Set `verbose_feature_names_out=False` to keep original names + their encoded value.
    verbose_feature_names_out=False,
    # Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)

print(out)

   temp_of_extremities_high  ...  mucous_membrane_True
0                       1.0  ...                   0.0
1                       0.0  ...                   0.0
2                       1.0  ...                   1.0
3                       0.0  ...                   0.0
4                       0.0  ...                   1.0
5                       0.0  ...                   0.0
6                       1.0  ...                   1.0
7                       0.0  ...                   0.0
8                       1.0  ...                   0.0
9                       1.0  ...                   1.0

如何在不将数据帧转换为数组的情况下执行一次热编码？

问题描述投票：0回答：1

1个回答

示例

资源

最新问题

如何在不将数据帧转换为数组的情况下执行一次热编码？

问题描述 投票：0回答：1

1个回答

示例

资源

最新问题

问题描述投票：0回答：1