我想验证数组在Pyspark中是否包含字符串(Spark <2.4)。
示例数据框:
column_1 <Array> | column_2 <String>
--------------------------------------------
["2345","98756","8794"] | 8794
--------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------
["3475","8956","45678"] | 3475
--------------------------------------------
我想比较两列column_1和column_2。如果column_1包含column_2,则应从column_1跳过它的值。我做了udf来从column_1中插入column_2,但是没有用:
def contains(x, y):
try:
sx, sy = set(x), set(y)
if len(sx) == 0:
return sx
elif len(sy) == 0:
return sx
else:
return sx - sy
# in exception, for example `x` or `y` is None (not a list)
except:
return sx
udf_contains = udf(contains, 'string')
new_df = my_df.withColumn('column_1', udf_contains(my_df.column_1, my_df.column_2))
期望结果:
column_1 <Array> | column_2 <String>
--------------------------------------------------
["2345","98756"] | 8794
--------------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------------
["8956","45678"] | 3475
--------------------------------------------------
[我怎么知道我有时候/案例中的column_1为[]而column_2为null?谢谢
火花2.4.0 +