我正在寻找合并和重塑 3 个表中的数据。我有 3 个表,大约有 250,000 行和 30 列。需要重塑以适应机器学习模型。
这是我在 stackoverlow 上发布的原始文章,其中详细介绍了要求。
这里是一个 github 存储库,其中包含 3 个表和代码,但尝试合并失败:
我尝试使用以下代码合并表格
# option #1 from stack overflow
tables = [Table_1, Table_2, Table_3]
out = (pd
.concat([t.set_index(['unique_ID','patient_ID', 'Week']) for t in tables], axis=1)
.stack().unstack(level='patient_ID').add_prefix('Patient ')
.pipe(lambda d: d.set_axis('Week'+d.index.get_level_values('Week').astype(str)
+' '+d.index.get_level_values(1))
)
.rename_axis(index='Clinical Data', columns=None).reset_index()
)
输出:
ValueError: Index contains duplicate entries, cannot reshape
# option 2 from stackoverflow
from functools import reduce
tables = [Table_1, Table_2, Table_3]
out = (reduce(lambda a, b: a.merge(b, on=['unique_ID','patient_ID','Week']), tables)
.melt(['unique_ID','patient_ID','Week'])
.assign(**{'Clinical Data': lambda d: 'Week'+d.pop('Week').astype(str)
+' '+d.pop('variable')})
.pivot(index='Clinical Data', columns='patient_ID', values='value')
.rename_axis(columns=None).reset_index()
)```
Output = Incomplete, only collects a small % of data and reshapes
出
Clinical Data 1
0 Week0 VISITID_x 15031
1 Week0 VISITID_y 15031
2 Week0 admin_location 1.0
3 Week0 alc_qty NaN
4 Week0 alc_result 0.0
5 Week0 alc_test 1.0
6 Week0 dose_received 8.0
7 Week0 medication 2.0
8 Week0 no_reason NaN
9 Week0 other_reason NaN
10 Week0 sr_alcohol 0.0
11 Week0 sr_amphetamine 0.0
12 Week0 sr_benzodiazepine 0.0
13 Week0 sr_cannabis 0.0
14 Week0 sr_cocaine 0.0
15 Week0 sr_methadone 0.0
16 Week0 sr_methanphetamine 0.0
17 Week0 sr_opiates 1.0
18 Week0 sr_other 0.0
19 Week0 sr_oxycodone 0.0
20 Week0 sr_propoxyphene 0.0
21 Week0 supervised 0.0
22 Week0 test_amphetamine 0.0
23 Week0 test_benzodiazepine 0.0
24 Week0 test_cannabis 0.0
25 Week0 test_cocaine 0.0
26 Week0 test_methadone 0.0
27 Week0 test_methamphetamine 0.0
28 Week0 test_opiate300 1.0
29 Week0 test_oxycodone 0.0
30 Week0 test_performed 1.0
31 Week0 test_propoxyphene 0.0