循环遍历列列表并丰富数据集

问题描述 投票:0回答:1

我有一个数据集,我想迭代列列表并使用两个新列丰富数据集 -

status
message

如果任何行、任何列为空,则状态应设置为

Failure
,并且消息应包含为空的列的列表:

Col : list[String] = df.coulmns.toList

df = df.withColumn("status" ,functions.lit("Success"))
       .withColumn("message",functions.lit(" "))
col.forEach(f=>{

val SuccessDf = df.filter(col(f).isNotNull)
var failureDf = df.filter(col(f).isNull)
                 
failureDf = failureDf.withColumn("status" ,functions.lit("Failure"))
.withColumn("message", concat(col("message"),f))

df = SuccessDf.union(failureDf) 
}
)

上述代码的问题是,运行大约 40K、大约 18 列的数据集需要永远运行

示例消息:

" Column : " + columnName + " is null"
dataframe scala apache-spark bigdata
1个回答
0
投票

请检查以下解决方案。

scala> df.show(30, false) // sample data
+----------------------------+----------+------+---+----------+----------+------+-----------+
|birthDate                   |firstName |gender|id |lastName  |middleName|salary|ssn        |
+----------------------------+----------+------+---+----------+----------+------+-----------+
|1955-07-02T04:00:00.000+0000|Pennie    |null  |1  |Hirschmann|Carry     |56172 |981-43-9345|
|1992-02-08T05:00:00.000+0000|An        |Female|2  |Cowper    |Amira     |40203 |null       |
|null                        |Quyen     |Female|3  |Dome      |Marlen    |53417 |957-57-8246|
|1990-04-11T04:00:00.000+0000|Coralie   |Female|4  |null      |null      |94727 |963-39-4885|
|1980-01-16T05:00:00.000+0000|Terrie    |Female|5  |Bonar     |Wava      |79908 |964-49-8051|
|1990-11-24T05:00:00.000+0000|null      |Female|6  |null      |null      |64652 |954-59-9172|
|1970-12-19T05:00:00.000+0000|Geri      |Female|7  |Mosby     |Tambra    |38195 |968-16-4020|
|null                        |Patria    |null  |8  |null      |Nancy     |null  |null       |
|1967-11-17T05:00:00.000+0000|Terese    |Female|9  |Tocque    |Alfredia  |91294 |967-48-7309|
|null                        |Wava      |null  |10 |null      |null      |56521 |null       |
|1979-09-17T04:00:00.000+0000|Sophie    |Female|11 |Hearn     |Emerita   |90920 |977-66-4483|
|1959-01-31T05:00:00.000+0000|Jodie     |Female|12 |Laneham   |Tabetha   |90634 |923-24-9769|
|1974-02-19T04:00:00.000+0000|Marietta  |Female|13 |Yansons   |Mandi     |93162 |900-34-8083|
|1960-09-26T04:00:00.000+0000|Caridad   |Female|14 |Snelle    |Maire     |38859 |992-11-7062|
|1960-01-29T05:00:00.000+0000|Yasmine   |Female|15 |Edworthye |Meg       |76220 |922-12-9862|
|1986-12-05T05:00:00.000+0000|Chan      |Female|16 |Hartas    |Jani      |75050 |995-51-3115|
|1961-09-29T04:00:00.000+0000|Evangeline|Female|17 |Casserley |Wanetta   |62814 |926-61-3526|
|1980-02-14T05:00:00.000+0000|Elnora    |Female|18 |Lipman    |Kecia     |71350 |950-23-9739|
|1978-11-14T05:00:00.000+0000|Adelle    |Female|19 |Grigoriev |Kathyrn   |60600 |923-23-5984|
|1973-11-24T05:00:00.000+0000|Mica      |Female|20 |Challens  |Zandra    |51071 |918-66-1232|
+----------------------------+----------+------+---+----------+----------+------+-----------+
scala> val statusExpr = when(
        df.columns.map(c => col(c).isNull).reduce(_ or _), 
        lit("Failure")
    )
    .otherwise(lit(""))
scala> val messageExpr = concat_ws(
        ", ", 
        df.columns.map(c => when(col(c).isNull, lit(s"Column: ${c} isNull"))):_*
    )
scala> df
       .withColumn("status", statusExpr)
       .withColumn("message", messageExpr)
       .show(30, false)

最终输出

+----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+
|birthDate                   |firstName |gender|id |lastName  |middleName|salary|ssn        |status |message                                                                                                                |
+----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+
|1955-07-02T04:00:00.000+0000|Pennie    |null  |1  |Hirschmann|Carry     |56172 |981-43-9345|Failure|Column: gender isNull                                                                                                  |
|1992-02-08T05:00:00.000+0000|An        |Female|2  |Cowper    |Amira     |40203 |null       |Failure|Column: ssn isNull                                                                                                     |
|null                        |Quyen     |Female|3  |Dome      |Marlen    |53417 |957-57-8246|Failure|Column: birthDate isNull                                                                                               |
|1990-04-11T04:00:00.000+0000|Coralie   |Female|4  |null      |null      |94727 |963-39-4885|Failure|Column: lastName isNull, Column: middleName isNull                                                                     |
|1980-01-16T05:00:00.000+0000|Terrie    |Female|5  |Bonar     |Wava      |79908 |964-49-8051|       |                                                                                                                       |
|1990-11-24T05:00:00.000+0000|null      |Female|6  |null      |null      |64652 |954-59-9172|Failure|Column: firstName isNull, Column: lastName isNull, Column: middleName isNull                                           |
|1970-12-19T05:00:00.000+0000|Geri      |Female|7  |Mosby     |Tambra    |38195 |968-16-4020|       |                                                                                                                       |
|null                        |Patria    |null  |8  |null      |Nancy     |null  |null       |Failure|Column: birthDate isNull, Column: gender isNull, Column: lastName isNull, Column: salary isNull, Column: ssn isNull    |
|1967-11-17T05:00:00.000+0000|Terese    |Female|9  |Tocque    |Alfredia  |91294 |967-48-7309|       |                                                                                                                       |
|null                        |Wava      |null  |10 |null      |null      |56521 |null       |Failure|Column: birthDate isNull, Column: gender isNull, Column: lastName isNull, Column: middleName isNull, Column: ssn isNull|
|1979-09-17T04:00:00.000+0000|Sophie    |Female|11 |Hearn     |Emerita   |90920 |977-66-4483|       |                                                                                                                       |
|1959-01-31T05:00:00.000+0000|Jodie     |Female|12 |Laneham   |Tabetha   |90634 |923-24-9769|       |                                                                                                                       |
|1974-02-19T04:00:00.000+0000|Marietta  |Female|13 |Yansons   |Mandi     |93162 |900-34-8083|       |                                                                                                                       |
|1960-09-26T04:00:00.000+0000|Caridad   |Female|14 |Snelle    |Maire     |38859 |992-11-7062|       |                                                                                                                       |
|1960-01-29T05:00:00.000+0000|Yasmine   |Female|15 |Edworthye |Meg       |76220 |922-12-9862|       |                                                                                                                       |
|1986-12-05T05:00:00.000+0000|Chan      |Female|16 |Hartas    |Jani      |75050 |995-51-3115|       |                                                                                                                       |
|1961-09-29T04:00:00.000+0000|Evangeline|Female|17 |Casserley |Wanetta   |62814 |926-61-3526|       |                                                                                                                       |
|1980-02-14T05:00:00.000+0000|Elnora    |Female|18 |Lipman    |Kecia     |71350 |950-23-9739|       |                                                                                                                       |
|1978-11-14T05:00:00.000+0000|Adelle    |Female|19 |Grigoriev |Kathyrn   |60600 |923-23-5984|       |                                                                                                                       |
|1973-11-24T05:00:00.000+0000|Mica      |Female|20 |Challens  |Zandra    |51071 |918-66-1232|       |                                                                                                                       |
+----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.