Pyspark -- 过滤包含空值的 ArrayType 行

Question

我是 PySpark 的初学者。假设我有一个像这样的 Spark 数据框：

test_df = spark.createDataFrame(pd.DataFrame({"a":[[1,2,3], [None,2,3], [None, None, None]]}))

现在我希望过滤数组不包含

None

值的行（在我的例子中只保留第一行）。

我尝试过使用：

test_df.filter(array_contains(test_df.a, None))

但是它不起作用并抛出错误：

AnalysisException：“无法解析 'array_contains(
a
, NULL)'，因为数据类型不匹配：空类型值不能用作论据；； '过滤 array_contains(a#166, null) +- 逻辑RDD [a#166]，错误

如何正确过滤？非常感谢！

Answer 1

您可以使用

exists

功能：

test_df.filter("!exists(a, x -> x is null)").show() 

#+---------+
#|        a|
#+---------+
#|[1, 2, 3]|
#+---------+

Answer 2

您需要使用

forall

功能。

df = test_df.filter(F.expr('forall(a, x -> x is not null)'))
df.show(truncate=False)

Answer 3

您可以使用

aggregate

高阶函数来计算空值的数量并过滤计数 = 0 的行。这将使您能够删除数组中至少具有 1

None

的所有行。

data_ls = [
    (1, ["A", "B"]),
    (2, [None, "D"]),
    (3, [None, None])
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['a', 'b'])

data_sdf.show()

+---+------+
|  a|     b|
+---+------+
|  1|[A, B]|
|  2| [, D]|
|  3|   [,]|
+---+------+

# count the number of nulls within an array
data_sdf. \
    withColumn('c', func.expr('aggregate(b, 0, (x, y) -> x + int(y is null))')). \
    show()

+---+------+---+
|  a|     b|  c|
+---+------+---+
|  1|[A, B]|  0|
|  2| [, D]|  1|
|  3|   [,]|  2|
+---+------+---+

创建列后，您可以将过滤器应用为

filter(func.col('c')==0)

。

Answer 4

您可以使用array_compact并且它可用

spark 3.4+ version

df.withColumn("a", array_compact("a"))

Pyspark -- 过滤包含空值的 ArrayType 行

问题描述投票：0回答：4

4个回答

最新问题

Pyspark -- 过滤包含空值的 ArrayType 行

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4