我想将一个数据框拆分为两个数据框,其中包括2种数据类型(品牌所有者和产品)。
原始数据框:
>>> products
product_id brand_owner product_name
0 344606 Cargill A
1 344607 Red Gold B
2 344608 FooBar C
3 344609 Red Gold D
4 344610 Cargill E
我想将brand_owner提取到另一个数据框中,就像规范化数据库一样:
>>> brand_owners = pd.DataFrame(branded_foods['brand_owner'].unique())
>>> brand_owners
0
0 Cargill
1 Kellogg Company Us
2 Kashi Us
3 Red Gold
4 Conagra Brands
... ...
我给它的行一个ID(同样,作为数据库主键)
>>> brand_owners.index += 1
>>> brand_owners['id'] = brand_owners.index
>>> brand_owners
0 id
1 Cargill 1
2 Kellogg Company Us 2
3 Kashi Us 3
4 Red Gold 4
5 Conagra Brands 5
... ... ...
[25202 rows x 2 columns]
>>> brand_owners.columns = ['name', 'id']
>>> brand_owners
name id
1 Cargill 1
2 Kellogg Company Us 2
3 Kashi Us 3
4 Red Gold 4
5 Conagra Brands 5
... ... ...
现在我想将此ID返回到原始数据框中,所以它将看起来像:
product_id brand_owner product_name
0 344606 1 A
1 344607 4 B
2 344608 45 C
3 344609 4 D
4 344610 1 E
我如何在熊猫中进行此更新:更新产品p设置p.brand_owner =(从brand_owners b中选择id,其中b.name = p.brand_owner)
您可以直接用brand_owner
对pd.factorize
中的类别进行编码:
pd.factorize
df['brand_owner'] = pd.factorize(df.brand_owner)[0]