我正在尝试编写一个sql查询以在pyspark中使用,以从pyspark df中清除信息。我要修改的df看起来像:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid 1_firstname 1_lastname 1_email 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid 3_firstname 3_lastname 3_email 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我还有另一个需要从customer_temp_tb表中删除的客户的pyspark df,如下所示:
hashed_customer eaterstatus
eater 1_uuid OPTED_OUT
eater 3_uuid OPTED_OUT
我正在尝试编写一个在pyspark中使用的SQL查询,如果客户在第二个表中,则该查询将从第一个表中删除名字,姓氏和电子邮件。有点像:
UPDATE customer_temp_tb
SET firstname="", lastname="", email=""
WHERE hashed_eater_uuid IN
(SELECT hashed_eater_uuid FROM opt_out_temp_tb)
这样最终结果将看起来像:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid NaN NaN NaN 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid NaN NaN NaN 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我似乎遇到的问题是pyspark不支持UPDATE。还有其他选择吗?
我认为,您可以将列更新为null或字符串空“”而不是delete。