当类别是多字符串中的单个字符时创建虚拟对象

问题描述 投票:3回答:2

考虑一下Pandas系列中的数据

s = pd.Series('1az wb58 jsui ne3'.split())

s

0     1az
1    wb58
2    jsui
3     ne3
dtype: object

我需要它看起来像:

   1  3  5  8  a  b  e  i  j  n  s  u  w  z
0  1  0  0  0  1  0  0  0  0  0  0  0  0  1
1  0  0  1  1  0  1  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  1  1  0  1  1  0  0
3  0  1  0  0  0  0  1  0  0  1  0  0  0  0

但是,当我尝试:

pd.get_dummies(s)

   1az  jsui  ne3  wb58
0    1     0    0     0
1    0     0    0     1
2    0     1    0     0
3    0     0    1     0

最简洁的方法是什么?

python pandas
2个回答
2
投票

也许适用list

pd.get_dummies(s.apply(list).apply(pd.Series).stack()).sum(level=0)
Out[222]: 
   1  3  5  8  a  b  e  i  j  n  s  u  w  z
0  1  0  0  0  1  0  0  0  0  0  0  0  0  1
1  0  0  1  1  0  1  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  1  1  0  1  1  0  0
3  0  1  0  0  0  0  1  0  0  1  0  0  0  0

要么

s.apply(list).str.join(',').str.get_dummies(',')
Out[224]: 
   1  3  5  8  a  b  e  i  j  n  s  u  w  z
0  1  0  0  0  1  0  0  0  0  0  0  0  0  1
1  0  0  1  1  0  1  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  1  1  0  1  1  0  0
3  0  1  0  0  0  0  1  0  0  1  0  0  0  0

2
投票

使用MultiLabelBinarizerDataFrame构造函数的解决方案:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
print (df)
   1  3  5  8  a  b  e  i  j  n  s  u  w  z
0  1  0  0  0  1  0  0  0  0  0  0  0  0  1
1  0  0  1  1  0  1  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  1  1  0  1  1  0  0
3  0  1  0  0  0  0  1  0  0  1  0  0  0  0

另一个解决方案 - DataFrame.from_records + get_dummies,但最后是max所需的聚合列:

df = pd.get_dummies(pd.DataFrame.from_records(s),prefix_sep='',prefix='').max(level=0, axis=1)
print (df)
   1  3  5  8  a  b  e  i  j  n  s  u  w  z
0  1  0  0  0  1  0  0  0  0  0  0  0  0  1
1  0  0  1  1  0  1  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  1  1  0  1  1  0  0
3  0  1  0  0  0  0  1  0  0  1  0  0  0  0
© www.soinside.com 2019 - 2024. All rights reserved.