如何通过替换pyspark中的for循环来优化代码？

Question

我必须在数据框中的所有列上实现以下函数。但是使用 for 循环对 Spark 性能不利，如何避免使用 for 循环并仍然具有相同的逻辑和输出？我的函数接受一个数据框并返回一个数据框，其中包含列名称和每列所需的统计信息。

这是函数：

def get_null_count_and_percentage(df):
   
   columnList = df.columns
   total_count = df.count()
   null_counts = []
   for column_to_check in columnList:      
        null_count = df.filter(col(column_to_check).isNull()).count()
        null_perentage = (null_count / total_count) * 100
        null_counts.append((column_to_check, null_count, null_perentage))     
   result_df_count = (
     spark.createDataFrame(null_counts, ["column_name", "null_counts", "null_percentage"])
    .withColumn("null_percentage", round(col("null_percentage"), 3))
     )
   return result_df_count

我尝试寻找，但找不到解决我的问题的确切解决方案。我尝试了map、reduce等，但这些都不能解决问题。

Answer 1

下面的代码将获得所有列的空计数

from pyspark.sql.functions import when, count, col
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

如何通过替换pyspark中的for循环来优化代码？

问题描述投票：0回答：1

1个回答

最新问题

如何通过替换pyspark中的for循环来优化代码？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1