在Dataframe中的列中添加缺少的类别

问题描述 投票:0回答:1

我有以下spark dataFrame。列国家/地区有10个不同的值。我想要在预期结果中给出的新数据帧。

DataFrame
+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|     Northwest|              0.87|
|            C|     Southwest|              0.44|
+-------------+--------------+------------------+

Distinct values for country column are :
+--------------+
|       country|
+--------------+
|     Australia|
|        Canada|
|       Central|
|        France|
|       Germany|
|     Northeast|
|     Northwest|
|     Southeast|
|     Southwest|
|United Kingdom|
+--------------+

Expected Result :

+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|     Australia|              null|
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            B|        Canada|              null|
|            B|       Central|              null|
|            B|        France|              null|
|            B|       Germany|              null|
|            B|     Northeast|              null|
|            B|     Northwest|              null|
|            B|     Southeast|              null|
|            B|     Southwest|              null|
|            B|United Kingdom|              null|
|            C|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|       Central|              null|
|            C|        France|              null|
|            C|       Germany|              null|
|            C|     Northeast|              null|
|            C|     Northwest|              0.87|
|            C|     Southeast|              null|
|            C|     Southwest|              0.44|
|            C|United Kingdom|              null|

如何在scala中实现此预期输出?我已经为数据集引用了函数/方法,但无法找到任何线索,我将从此开始。

请注意,可能有多列,因此对于多列逻辑是相同的,我想在所有列中针对每个类别插入缺少的类别。

我是初学者,它会引发scala。提前致谢 :)

scala apache-spark apache-spark-dataset
1个回答
1
投票

交叉加入不同的代码与国家,然后将其连接到原始表格的类似

val codes= data.select($"Code").distinct
val combinations = codes.crossJoin(countries)
val result = combinations.join(data, combinations("code")===data("code") && combinations("country")===data("country"),"leftouter").select(combinations("code"),combinations("coiuntry"),data("t1")).orderBy($"code",$"value")
© www.soinside.com 2019 - 2024. All rights reserved.