pyspark数据帧获得每一行的第二个最小值

Question

[如果有人有想法，我想查询一下如何在pyspark的Dataframe行中获取第二个最低值。

例如：

输入数据框：

Col1  Col2  Col3  Col4 
83    32    14    62   
63    32    74    55   
13    88     6    46

预期输出：

Col1  Col2  Col3  Col4 Res
83    32    14    62   32   
63    32    74    55   55   
13    88     6    46   13

谢谢

Answer 1

我们可以使用concat_ws函数为该行合并所有列，然后使用split创建一个数组。

使用array_sort函数在数组中进行排序并提取数组的second element[1]。

Example:

from pyspark.sql.functions import *

df=spark.createDataFrame([('83','32','14','62'),('63','32','74','55'),('13','88','6','46')],['Col1','Col2','Col3','Col4'])

df.selectExpr("array_sort(split(concat_ws(',',Col1,Col2,Col3,Col4),','))[1] Res").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

More Dynamic Way:

df.selectExpr("array_sort(split(concat_ws(',',*),','))[1]").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

EDIT:

#adding Res column to the dataframe
df1=df.selectExpr("*","array_sort(split(concat_ws(',',*),','))[1] Res")
df1.show()

#+----+----+----+----+---+
#|Col1|Col2|Col3|Col4|Res|
#+----+----+----+----+---+
#|  83|  32|  14|  62| 32|
#|  63|  32|  74|  55| 55|
#|  13|  88|   6|  46| 46|
#+----+----+----+----+---+

Answer 2

您可以使用array函数创建一个数组列，然后使用array对其进行排序。最后，使用array_sort获取第二个元素。 Spark 2.4+提供了这最后两个功能。

array_sort

另一种方法是使用element_at功能。首先，使用element_at表达式从所有列中计算出最小值，然后从大于df.withColumn("res", element_at(array_sort(array(*[col(c) for c in df.columns])), 2))\ .show() #+----+----+----+----+---+ #|Col1|Col2|Col3|Col4|res| #+----+----+----+----+---+ #|83 |32 |14 |62 |32 | #|63 |32 |74 |55 |55 | #|13 |88 |6 |46 |13 | #+----+----+----+----+---+的值中计算出最小的时间：

least

pyspark数据帧获得每一行的第二个最小值

问题描述投票：2回答：2

2个回答

最新问题

pyspark数据帧获得每一行的第二个最小值

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2