pyspark中两个左连接的问题

问题描述 投票:0回答:1

目前我有三个数据集

table 1:
| day | spend | id |
| -------- | -------- | -------- |
| 2023-02-03   | 2.4   | 1 |

table 2:
| day | name | id |
| -------- | -------- | -------- |
| 2023-02-03   | Apple   | 1 |
| 2023-02-06   | Banana   | 2 |

table 3:
| prev_day | prev_name | prev_id |
| -------- | -------- | -------- |
| 2023-02-02   | Apple   | 1 |
| 2023-02-05   | Banana   | 2 |

table1
  .join(table2, table1("id") === table2("id") && table1("day") === table2("day"), "left")
  .join(table3, table1("id") === table3("prev_id") && table1("day") === table3("prev_day"), "left")
  .select(table1("day"), table1("spend"), table1("id"), table2("name").as("name1"), table3("prev_name").as("name2"))

result:
| day | spend | id | name1 | name2 |
| -------- | -------- | -------- | -------- | -------- |
| 2023-02-03   | 2.4   | 1 | null | null |

但是当我删除之前的连接时,我会得到正确的结果:

table1
  .join(table3, table1("id") === table3("prev_id") && table1("day") === table3("prev_day"), "left")
  .select(table1("day"), table1("spend"), table1("id"), table3("prev_name").as("name2"))

result:
| day | spend | id | name2 |
| -------- | -------- | -------- | -------- | -------- |
| 2023-02-03   | 2.4   | 1 | Apple |

有人知道这是怎么回事吗?

我不知道如何解决:(

sql pyspark apache-spark-sql left-join
1个回答
0
投票

你的问题是表1的日期比表3的日期多了一天,所以连接不适合,减去表1的一天就可以了

table1
  .join(table2, table1("id") === table2("id") && table1("day") === table2("day"), "left")
  .join(table3, table1("id") === table3("prev_id") &&  date_sub(table1("day"),1)  === table3("prev_day"), "left")
  .select(table1("day"), table1("spend"), table1("id"), table2("name").as("name1"), table3("prev_name").as("name2"))
© www.soinside.com 2019 - 2024. All rights reserved.