我想爆炸Spark Scala中的列
reference_month M M+1 M+2
2020-01-01 10 12 10
2020-02-01 10 12 10
输出应该像
reference_month Month reference_date_id
2020-01-01 10 2020-01
2020-01-01 12 2020-02
2020-01-01 10 2020-03
2020-02-01 10 2020-02
2020-02-01 12 2020-03
2020-02-01 10 2020-04
其中reference_date_id = reference_month + x(其中x源自m,m + 1,m + 2)。
有什么方法可以在spark scala中获得这种格式的输出?
我们可以用array
,M
,M+1
创建M+2
,然后分解array
以获得所需的数据帧。
Example:
df.selectExpr("reference_month","array(M,`M+1`,`M+2`)as arr").
selectExpr("reference_month","explode(arr) as Month").show()
+---------------+-----+
|reference_month|Month|
+---------------+-----+
| 202001| 10|
| 202001| 12|
| 202001| 10|
| 202002| 10|
| 202002| 12|
| 202002| 10|
+---------------+-----+
//or
val cols= Seq("M","M+1","M+2")
df.withColumn("arr",array(cols.head,cols.tail:_*)).drop(cols:_*).
selectExpr("reference_month","explode(arr) as Month").show()
您可以取消Apache Spark的技术
import org.apache.spark.sql.functions.expr
data.select($"reference_month",expr("stack(3,`M`,`M+1`,`M+2`) as (Month )")).show()
You can use **stack** function
from pyspark.sql.functions import expr
exp = expr("""stack(3,`M`,`M+1`,`M+2`) as (Values)""")
from pyspark.sql.functions import when,concat_ws,lpad,row_number
from pyspark.sql.window import Window
w = Window().partitionBy("reference_month").orderBy("reference_month")
df.select("reference_month",exp)\
.withColumn("reference_date_id ",concat_ws('-',substring("reference_month",1,4),\
when(length(row_number().over(w))<2,lpad(row_number().over(w),2,'0'))\
.otherwise(row_number().over(w)))).show()
+---------------+----+------------------+
|reference_month|valz|reference_date_id |
+---------------+----+------------------+
| 202022| 10| 2020-01|
| 202022| 12| 2020-02|
| 202022| 10| 2020-03|
| 202001| 10| 2020-01|
| 202001| 12| 2020-02|
| 202001| 10| 2020-03|
+---------------+----+------------------+