根据另一列的元素从pyspark数组中删除元素

问题描述 投票:2回答:1

我想验证数组在Pyspark中是否包含字符串(Spark <2.4)。

示例数据框:

column_1 <Array>           |    column_2 <String>
--------------------------------------------
["2345","98756","8794"]    |       8794
--------------------------------------------
["8756","45678","987563"]  |       1234
--------------------------------------------
["3475","8956","45678"]    |       3475
--------------------------------------------

我想比较两列column_1和column_2。如果column_1包含column_2,则应从column_1跳过它的值。我做了udf来从column_1中插入column_2,但是没有用:

def contains(x, y):
        try:
            sx, sy = set(x), set(y)
            if len(sx) == 0:
                return sx
            elif len(sy) == 0:
                return sx
            else:
                return sx - sy            
        # in exception, for example `x` or `y` is None (not a list)
        except:
            return sx
    udf_contains = udf(contains, 'string')
    new_df = my_df.withColumn('column_1', udf_contains(my_df.column_1, my_df.column_2))  

期望结果:

column_1 <Array>           |    column_2 <String>
--------------------------------------------------
["2345","98756"]           |       8794
--------------------------------------------------
["8756","45678","987563"]  |       1234
--------------------------------------------------
["8956","45678"]           |       3475
--------------------------------------------------

[我怎么知道我有时候/案例中的column_1为[]而column_2为null?谢谢

apache-spark pyspark apache-spark-sql
1个回答
2
投票

火花2.4.0 +

© www.soinside.com 2019 - 2024. All rights reserved.