合并两个数据帧以产生一个文件的重复值

问题描述 投票:1回答:2

我有一个带有2列的.XLSX文件。

和。列存储到多个数据链接,并用分号分隔。我需要在输入上操纵该数据集,而我却很难考虑要执行哪个最佳方向。

[考虑用(,)代替分号,然后将数据打包到字典中,其中的'是'key',而the成为列表'value'。

但是,我不确定这是最有效的方法。

d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']

df =pd.DataFrame(data=d)
df.head()

sku = df['<Name>']
upc = df['<Item To Package>']

PartToUPC = {}
PartToUPC[sku]=upc

下面是我需要文件的外观

A列中的所有skus和B列中的它们各自的软件包代码

FIL9791 | package_113572195

FIL9791 | package_113594355

FIL9799 | package_113572197

FIL9799 | package_113594357

python pandas
2个回答
1
投票

我想这就是您需要的

d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']}
df = pd.DataFrame(data=d)
new_df = pd.DataFrame(df["<Item To Package>"].str.split(';').tolist(), index=df["<Name>"]).stack()
new_df = new_df.reset_index([0, '<Name>'])
new_df.columns = ['<Name>', '<Item To Package>']

0
投票

您可以在将列分成两个值的列表之后使用.explode()解决此问题。

import pandas as pd 
import numpy as np
from pandas.io.json import json_normalize
d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']}

df = pd.DataFrame(data=d)
df['<Item To Package>'] = df['<Item To Package>'].str.split(';') 
df = df.explode('<Item To Package>')
print(df)

输出:

    <Name>  <Item To Package>
0  FIL9791  package_113572195
0  FIL9791  package_113594355
1  FIL9799  package_113572197
1  FIL9799  package_113594357
2  FIL4056  package_113566689
2  FIL4056  package_113591417
3  FIL4056  package_113566688
3  FIL4056  package_113591416
4  FIL4057  package_113566690
4  FIL4057  package_113591418

请紧记explode()为您应用的列保留原始索引。因此,如果您希望重置索引是因为您不需要它们与原始索引匹配。您可以添加:

df = df.reset_index(drop=True)
print(df)

输出:

    <Name>  <Item To Package>
0  FIL9791  package_113572195
1  FIL9791  package_113594355
2  FIL9799  package_113572197
3  FIL9799  package_113594357
4  FIL4056  package_113566689
5  FIL4056  package_113591417
6  FIL4056  package_113566688
7  FIL4056  package_113591416
8  FIL4057  package_113566690
9  FIL4057  package_113591418
© www.soinside.com 2019 - 2024. All rights reserved.