Pyspark 性能改进

Question

我在 PySpark df 中使用以下代码：

for col in df.columns:
    df = df.withColumn(col, F.rank()over(Window.orderBy((col))))

由于我的 df 有 2k 列，因此时间效率非常低。我怎样才能改进它以减少时间？尝试了一些 UDF 但没有成功。

Answer 1

请注意，尝试不带

partitionBy

子句的窗口函数会导致大数据性能下降。尝试对数据进行建模，以便在某些列上可以应用分区。

据说 PySpark 会抛出此警告来表明问题：

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

但是，作为

for

循环的替代方案，您可以使用

selectExpr

方法，该方法可以使用单遍，并且还利用 Catalyst 优化器来使其运行速度更快。

# Create a list of SQL expressions that apply the rank function to each column
exprs = [f"rank() OVER (ORDER BY {col}) as {col}" for col in df.columns]

# Apply all expressions at once using selectExpr
df = df.selectExpr(*exprs)

或

# Create a list of expressions that apply the rank function to each column
exprs = [F.rank().over(Window.orderBy(col)).alias(col) for col in df.columns]

# Apply all expressions at once using selectExpr
df = df.select(*exprs)

看看是否有可能在任何列上应用

PARTITION BY

或

partitionBy

来提高大型表的性能。

Pyspark 性能改进

问题描述投票：0回答：1

1个回答

最新问题

Pyspark 性能改进

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1