当前数据框
+-----------------+--------------------+
|__index_level_0__| Text_obj_col|
+-----------------+--------------------+
| 1| [ ,entrepreneurs]|
| 2|[eat, , human, poop]|
| 3| [Manafort, case]|
| 4| [Sunar, Khatris, ]|
| 5|[become, arrogant, ]|
| 6| [GPS, get, name, ]|
| 7|[exactly, reality, ]|
+-----------------+--------------------+
我希望从列表中删除该空字符串。这是测试数据,实际数据相当大,如何在pyspark中做到这一点?
您可以使用udf
完成此任务:
from pyspark.sql.functions import udf
def filter_empty(l):
return filter(lambda x: x is not None and len(x) > 0, l)
filter_empty_udf = udf(filter_empty, ArrayType(StringType()))
df.select(filter_empty_udf("Text_obj_col").alias("Text_obj_col")).show(10, False)
在示例的几行上进行了测试:
+------------------+
|Text_obj_col |
+------------------+
|[entrepreneurs] |
|[eat, human, poop]|
+------------------+