我有以下示例数据框:
df = pd.DataFrame(data = {'RecordID' : [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5], 'DisplayLabel' : ['Source','Test','Value 1','Value 2','Value3','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2'],
'Value' : ['Web','Logic','S','I','Complete','Person','Voice','>20','P','Mail','OCR','A','I','Dictation','Understandable','S','I','Web','Logic','R','S']})
创建此数据框:
+-------+----------+---------------+----------------+
| Index | RecordID | Display Label | Value |
+-------+----------+---------------+----------------+
| 0 | 1 | Source | Web |
| 1 | 1 | Test | Logic |
| 2 | 1 | Value 1 | S |
| 3 | 1 | Value 2 | I |
| 4 | 1 | Value 3 | Complete |
| 5 | 2 | Source | Person |
| 6 | 2 | Test | Voice |
| 7 | 2 | Value 1 | >20 |
| 8 | 2 | Value 2 | P |
| 9 | 3 | Source | Mail |
| 10 | 3 | Test | OCR |
| 11 | 3 | Value 1 | A |
| 12 | 3 | Value 2 | I |
| 13 | 4 | Source | Dictation |
| 14 | 4 | Test | Understandable |
| 15 | 4 | Value 1 | S |
| 16 | 4 | Value 2 | I |
| 17 | 5 | Source | Web |
| 18 | 5 | Test | Logic |
| 19 | 5 | Value 1 | R |
| 20 | 5 | Value 2 | S |
+-------+----------+---------------+----------------+
我试图将源列和测试列完全不“融化”到新的数据框列中,以使其看起来像这样:
+-------+----------+-----------+----------------+---------------+----------+
| Index | RecordID | Source | Test | Result | Value |
+-------+----------+-----------+----------------+---------------+----------+
| 0 | 1 | Web | Logic | Value 1 | S |
| 1 | 1 | Web | Logic | Value 2 | I |
| 2 | 1 | Web | Logic | Value 3 | Complete |
| 3 | 2 | Person | Voice | Value 1 | >20 |
| 4 | 2 | Person | Voice | Value 2 | P |
| 5 | 3 | Mail | OCR | Value 1 | A |
| 6 | 3 | Mail | OCR | Value 2 | I |
| 7 | 4 | Dictation | Understandable | Value 1 | S |
| 8 | 4 | Dictation | Understandable | Value 2 | I |
| 9 | 5 | Web | Logic | Value 1 | R |
| 10 | 5 | Web | Logic | Value 2 | S |
+-------+----------+-----------+----------------+---------------+----------+
据我了解,枢轴和融合将完成整个DisplayLabel列,而不仅仅是某些值。
[我阅读了Pandas Melt和Pandas Pivot以及一些关于stackoverflow的参考文献后,我们将不胜感激,我似乎无法找到一种快速完成此操作的方法。
谢谢!
set_index
,unstack
,然后是melt
] >>df.set_index(['RecordID', 'DisplayLabel']).Value.unstack().reset_index() \
.melt(['RecordID', 'Source', 'Test'], var_name='Result', value_name='Value') \
.sort_values('RecordID').dropna(subset=['Value'])
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
5 1 Web Logic Value 2 I
10 1 Web Logic Value 3 Complete
1 2 Person Voice Value 1 >20
6 2 Person Voice Value 2 P
2 3 Mail OCR Value 1 A
7 3 Mail OCR Value 2 I
3 4 Dictation Understandable Value 1 S
8 4 Dictation Understandable Value 2 I
4 5 Web Logic Value 1 R
9 5 Web Logic Value 2 S
groupby
的自定义功能def f(t):
name, df = t
d = dict(zip(df['DisplayLabel'], df['Value']))
source = d.pop('Source')
test = d.pop('Test')
result, value = zip(*d.items())
return pd.DataFrame(
dict(RecordID=name, Source=source, Test=test, Result=result, Value=value)
)
pd.concat(map(f, df.groupby('RecordID')))
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
1 1 Web Logic Value 2 I
2 1 Web Logic Value 3 Complete
0 2 Person Voice Value 1 >20
1 2 Person Voice Value 2 P
0 3 Mail OCR Value 1 A
1 3 Mail OCR Value 2 I
0 4 Dictation Understandable Value 1 S
1 4 Dictation Understandable Value 2 I
0 5 Web Logic Value 1 R
1 5 Web Logic Value 2 S
df = pd.DataFrame(data={
'RecordID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5],
'DisplayLabel': [
'Source', 'Test', 'Value 1', 'Value 2', 'Value 3',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2'
],
'Value': [
'Web', 'Logic', 'S', 'I', 'Complete',
'Person', 'Voice', '>20', 'P',
'Mail', 'OCR', 'A', 'I',
'Dictation', 'Understandable', 'S', 'I',
'Web', 'Logic', 'R', 'S'
]
})
我们可以通过应用逻辑和数据透视来达到您的结果,我们通过检查DisplayLabel
是否包含Value
来拆分您的数据,然后我们将join
一起归还给他们:
我尝试了一种不同的方法,首先使用pivot
进行unstack
处理,然后部分转换wide_to_long
(对不起,如果效率不高,但这似乎可以获得所需的输出)