每列中有多个值的DataFrame。如何在主标题下对它们进行一次性编码?

问题描述 投票:0回答:2

我有一个 DataFrame,每列中都有一个变量列表。我不知道如何对每一列中的数据进行 One-Hot 编码。

In:

lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']]
    
df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float)
Out:

        A                     B                        C
Ella    Red, Blue, Yellow     Blue, Green, Yellow      Green, Red, Blue
Mike    Yellow, Red, Blue     Blue, Red, Green         Yellow, Blue, Red
Dave    Yellow, Red, Green    Red, Yellow, Blue        Green, Blue, Red

我希望用多层列标题创建它,如下所示:

       A                                 B                               C
       Red    Blue   Green   Yellow      Red    Blue   Green   Yellow    ....
Ella   1      1      0       1           0      1      1       1         ....
Mike   1      1      0       1           1      1      1       0         ....   
Dave   1      0      1       1           1      1      0       1         ....                                                                                                                                                     

我将非常感谢您的一些指导,因为我已经在这方面坚持了一段时间了!

python pandas dataframe multi-index one-hot-encoding
2个回答
1
投票

这里有一个非常好的答案。在您的情况下,您必须将相同的内容应用于不同的列,所以类似(可以进一步优化):

import pandas as pd from sklearn.preprocessing import MultiLabelBinarizer import numpy as np lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']] df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float) mlb = {} res = {} for column in df.columns: mlb[column] = MultiLabelBinarizer() res[column] = pd.DataFrame(mlb[column].fit_transform(df[column].apply(lambda x: [j.strip() for j in x.split(",")])), columns=mlb[column].classes_, index=df[column].index) arrays = [np.concatenate(([np.array([column]*len(mlb[column].classes_)) for column in df.columns])), np.concatenate(([mlb[column].classes_ for column in df.columns]))] df_end = pd.DataFrame(columns = arrays, index = [0,1,2]) for column in df.columns: df_end[column] = res[column] df_end A B C Blue Green Red Yellow Blue Green Red Yellow Blue Green Red Yellow 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 2 0 1 1 1 1 0 1 1 1 1 1 0
    

1
投票
这里有一个方法:

df.stack().str.get_dummies(sep = ', ').unstack().swaplevel(axis=1).sort_index(level=0,axis=1)

df = df.stack().str.get_dummies(sep=',') df.columns = df.columns.str.strip() df = df.stack().groupby(level=[0,1,2]).sum().unstack(level=[1,2])
    
© www.soinside.com 2019 - 2024. All rights reserved.