我有一个带有2列的.XLSX文件。
和。列存储到多个数据链接,并用分号分隔。我需要在输入上操纵该数据集,而我却很难考虑要执行哪个最佳方向。
[考虑用(,)代替分号,然后将数据打包到字典中,其中的'是'key',而the成为列表'value'。
但是,我不确定这是最有效的方法。
d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']
df =pd.DataFrame(data=d)
df.head()
sku = df['<Name>']
upc = df['<Item To Package>']
PartToUPC = {}
PartToUPC[sku]=upc
下面是我需要文件的外观
A列中的所有skus和B列中的它们各自的软件包代码
FIL9791 | package_113572195
FIL9791 | package_113594355
FIL9799 | package_113572197
FIL9799 | package_113594357
等
我想这就是您需要的
d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']}
df = pd.DataFrame(data=d)
new_df = pd.DataFrame(df["<Item To Package>"].str.split(';').tolist(), index=df["<Name>"]).stack()
new_df = new_df.reset_index([0, '<Name>'])
new_df.columns = ['<Name>', '<Item To Package>']
您可以在将列分成两个值的列表之后使用.explode()
解决此问题。
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']}
df = pd.DataFrame(data=d)
df['<Item To Package>'] = df['<Item To Package>'].str.split(';')
df = df.explode('<Item To Package>')
print(df)
输出:
<Name> <Item To Package>
0 FIL9791 package_113572195
0 FIL9791 package_113594355
1 FIL9799 package_113572197
1 FIL9799 package_113594357
2 FIL4056 package_113566689
2 FIL4056 package_113591417
3 FIL4056 package_113566688
3 FIL4056 package_113591416
4 FIL4057 package_113566690
4 FIL4057 package_113591418
请紧记explode()
为您应用的列保留原始索引。因此,如果您希望重置索引是因为您不需要它们与原始索引匹配。您可以添加:
df = df.reset_index(drop=True)
print(df)
输出:
<Name> <Item To Package>
0 FIL9791 package_113572195
1 FIL9791 package_113594355
2 FIL9799 package_113572197
3 FIL9799 package_113594357
4 FIL4056 package_113566689
5 FIL4056 package_113591417
6 FIL4056 package_113566688
7 FIL4056 package_113591416
8 FIL4057 package_113566690
9 FIL4057 package_113591418