如何合并数据帧A和B以将数据帧作为数据帧C
DF A:
X Y
0-10 10-25
10-20 25-75
20-30 75-150
DF B:
Binned Name Value
0-10 X 20
10-20 X 100
20-30 X 200
10-25 Y 90
25-75 Y 25
75-150 Y 90
DF C:
X X_Val Y Y_Val
0-10 20 10-25 90
10-20 100 25-75 25
20-30 200 75-150 30
这应该工作。
# pivot B to make columns X & Y
df_b = df_b.pivot_table(values=['Value'], index=['Binned'], columns=['Name']).reset_index()
df_b.columns = ['Binned', 'X', 'Y']
# merge X & Y cols sequentially
df_c = pd.merge(df_a, df_b[['Binned', 'X']], how='left', left_on=['X'], right_on=['Binned'], suffixes=('', '_Val'))
df_c = pd.merge(df_c, df_b[['Binned', 'Y']], how='left', left_on=['Y'], right_on=['Binned'], suffixes=('', '_Val'))
df_c = df_c[['X', 'X_Val', 'Y', 'Y_Val']]
# X X_Val Y Y_Val
# 0 0-10 20.0 10-25 90.0
# 1 10-20 100.0 25-75 25.0
# 2 20-30 200.0 75-150 90.0
我想你需要:
#reshape dfA for inner merge with dfB
df1 = dfA.melt(var_name='Name', value_name='Binned')
df = dfB.merge(df1)
#reshape for multiple columns by groups
df = (df.set_index([df.groupby('Name').cumcount(), 'Name'])
.unstack()
.sort_index(axis=1, level=1)
.rename(columns={'Binned':'','Value':'_Val'})
.swaplevel(0,1,axis=1))
df.columns = df.columns.map(''.join)
print (df)
X X_Val Y Y_Val
0 0-10 20 10-25 90
1 10-20 100 25-75 25
2 20-30 200 75-150 90
为每个名称编写sql查询
df1=spark.sqlContext("select * from DF_B where name='X'")
df2=spark.sqlContext("select * from DF_B where name='Y'")
为每个具有行ID的数据框创建一列
df1_id= df1.withColumn("id", monotonically_increasing_id())
df2_id= df2.withColumn("id", monotonically_increasing_id())
现在我们可以加入数据帧df1_id和df2_id。
df1_id.join(df2_id,"id").show()