如何将共享ID的多行合并为一个单行（PYSPARK）

Question

我在PySpark中有此数据框。我想获得col3的唯一值。在SQL中，我将按col1分组，并获得max（col3）作为col3

+ ---- + ---- + ---- +| col1 | col2 | col3 |+ ---- + ---- + ---- +| 0 | 1 | 0 || 0 | 1 | 0 || 0 | 1 | 0 || 1 | 1 | 0 || 1 | 1 | 1 || 1 | 1 | 1 || 2 | 1 | 0 || 2 | 1 | 1 || 2 | 1 | 0 |+ ---- + ---- + ---- +

这是预期的输出：

+ ---- + ---- + ---- +| col1 | col2 | col3 |+ ---- + ---- + ---- +| 0 | 1 | 0 || 1 | 1 | 1 || 2 | 1 | 1 |+ ---- + ---- + ---- +

Answer 1

您可以在col1，col2上的pyspark .groupBy中执行相同的逻辑，然后agg获取最大col3值。

其他方式将使用窗口row_number函数和partitionby col1，col2和orderby desc col3并仅选择rownumber == 1

Example:

df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   0|   1|   0|
#|   0|   1|   0|
#|   0|   1|   0|
#|   1|   1|   0|
#|   1|   1|   1|
#|   1|   1|   1|
#|   2|   1|   0|
#|   2|   1|   1|
#|   2|   1|   0|
#+----+----+----+

df.groupBy("col1","col2").agg(max("col3").alias("col3")).orderBy("col3").show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   0|   1|   0|
#|   1|   1|   1|
#|   2|   1|   1|
#+----+----+----+

Using row_number():

from pyspark.sql.window import Window

w = Window.partitionBy("col1","col2").orderBy(desc("col3"))

df.withColumn("rn", row_number().over(w)).filter(col("rn") == 1).drop("rn").orderBy("col3").show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   0|   1|   0|
#|   1|   1|   1|
#|   2|   1|   1|
#+----+----+----+

如何将共享ID的多行合并为一个单行（PYSPARK）

问题描述投票：0回答：1

1个回答

最新问题

如何将共享ID的多行合并为一个单行（PYSPARK）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1