我有一个 DataFrame,每列中都有一个变量列表。我不知道如何对每一列中的数据进行 One-Hot 编码。
In:
lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']]
df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float)
Out:
A B C
Ella Red, Blue, Yellow Blue, Green, Yellow Green, Red, Blue
Mike Yellow, Red, Blue Blue, Red, Green Yellow, Blue, Red
Dave Yellow, Red, Green Red, Yellow, Blue Green, Blue, Red
我希望用多层列标题创建它,如下所示:
A B C
Red Blue Green Yellow Red Blue Green Yellow ....
Ella 1 1 0 1 0 1 1 1 ....
Mike 1 1 0 1 1 1 1 0 ....
Dave 1 0 1 1 1 1 0 1 ....
我将非常感谢您的一些指导,因为我已经在这方面坚持了一段时间了!
这里有一个非常好的答案。在您的情况下,您必须将相同的内容应用于不同的列,所以类似(可以进一步优化):
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']]
df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float)
mlb = {}
res = {}
for column in df.columns:
mlb[column] = MultiLabelBinarizer()
res[column] = pd.DataFrame(mlb[column].fit_transform(df[column].apply(lambda x: [j.strip() for j in x.split(",")])),
columns=mlb[column].classes_,
index=df[column].index)
arrays = [np.concatenate(([np.array([column]*len(mlb[column].classes_)) for column in df.columns])),
np.concatenate(([mlb[column].classes_ for column in df.columns]))]
df_end = pd.DataFrame(columns = arrays, index = [0,1,2])
for column in df.columns:
df_end[column] = res[column]
df_end
A B C
Blue Green Red Yellow Blue Green Red Yellow Blue Green Red Yellow
0 1 0 1 1 1 1 0 1 1 1 1 0
1 1 0 1 1 1 1 1 0 1 0 1 1
2 0 1 1 1 1 0 1 1 1 1 1 0
df.stack().str.get_dummies(sep = ', ').unstack().swaplevel(axis=1).sort_index(level=0,axis=1)
或
df = df.stack().str.get_dummies(sep=',')
df.columns = df.columns.str.strip()
df = df.stack().groupby(level=[0,1,2]).sum().unstack(level=[1,2])