pyspark-dataframe联合的火花作业未结束

问题描述 投票:3回答:2

我正在尝试使用以下代码来合并我拥有的所有数据帧,然后根据列时间戳按降序对结果数据帧进行排序。

dfs = [df1, df2, df3]
    df_final = reduce(DataFrame.union, dfs).sort(col('timestamp').desc())

火花工作没有结束,只是停留在这里,可能是什么问题?我之前大约三天前运行过相同的代码,并且运行良好。现在由于某种原因它还没有结束。也没有显示错误。我尝试使用unionByName(),即使这样也会出现相同的问题。我该怎么办?

这是数据框的外观,

+---------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params                                                                                                                                               |timestamp          |
+---------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile  |UPDATE   |[member_id -> cqhi6k5lby43pr3iethfmcp8sjq7_STG, easy_id -> 993270334, field -> password_hash, member_uuid -> 027130fe-584d-4d8e-9fb0-b87c984a0c20]   |2020-02-11 19:15:32|
|profile  |UPDATE   |[member_id -> cqhi6k5lby43pr3iethfmcp8sjq7_STG, easy_id -> 993270334, field -> password_hash, member_uuid -> 027130fe-584d-4d8e-9fb0-b87c984a0c20]   |2020-02-11 19:07:34|

+---------+---------+--------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params                                                                                                                    |timestamp          |
+---------+---------+--------------------------------------------------------------------------------------------------------------------------+-------------------+
|member   |CREATE   |[member_id -> h4m015wf1qxwrogj6d9l2uc5bsa9_STG, easy_id -> 993270331, member_uuid -> ea8e7e39-4a0a-4d41-b47e-70c8e56a2bca]|2020-01-02 09:51:32|
|member   |CREATE   |[member_id -> oeip31lpid9cexl9o5asip92idh7_STG, easy_id -> 993270336, member_uuid -> 9e65124b-cb16-4803-b74d-c0b6a3cb083a]|2020-01-01 10:31:32|

+---------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation     |params                                                                                                                                                                                               |timestamp          |
+---------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile  |CREATE_CARD   |[member_id -> 1s9miu7t6an50fplvvhybow6edx9_STG, easy_id -> 993270335, created_by -> kobo, card_token -> 8236961209881953, member_uuid -> 50d966f2-2820-441a-afbe-851e45eeb13e]                       |2020-02-24 03:07:04|
|profile  |CREATE_CARD   |[member_id -> ajuypjtnlzmk4na047cgav27jma6_STG, easy_id -> 993270327, created_by -> beats, card_token -> 9000141161458480, member_uuid -> 2dec548e-681d-11ea-bc55-0242ac130003]                      |2020-01-11 02:01:53|
python apache-spark pyspark apache-spark-sql pyspark-dataframes
2个回答
0
投票

尝试将您的代码更改为此;

dfs = [df1, df2, df3]
res1 = pd.concat(dfs)
df_final = reduce(DataFrame.union,res1).sort(col('timestamp').desc())

0
投票

我不确定您在示例中是指哪个reduce,尽管这是我编写相同代码的方式:

dfs = [df1, df2, df3]

final_df = reduce(lambda a, b: a.union(b), dfs)
© www.soinside.com 2019 - 2024. All rights reserved.