如何按列表顺序对 PySpark 数据帧行进行排序?

问题描述 投票:0回答:1

我有一个包含多个列的 pySpark 数据框和一个包含其中一个列项目的列表。我想按给定列表的顺序对行进行排序。

col_A col_B col_c
a1 b1 c1
a2 b2 c2
a3 b3 b3

col_A_itm_order = ['a2', 'a3', 'a1']

预期产量

col_A col_B col_c
a2 b2 c2
a3 b3 c3
a1 b1 b1

我发现 Pandas 数据框有类似的问题,但 PySpark 没有。

python dataframe apache-spark pyspark apache-spark-sql
1个回答
0
投票
from pyspark.sql import SparkSession as ss
from pyspark.sql.functions import col

# Assuming spark is already created
spark = ss.builder.appName("Sortdf").getOrCreate()

# DataFrame
data = [("a1", "b1", "c1"),
        ("a2", "b2", "c2"),
        ("a3", "b3", "c3")]

df = spark.createDataFrame(data, ["col_A", "col_B", "col_C"])

def sortdf(cl, order):
    
    # Create the sort_logic dynamically
    sort_logic = [col(cl).desc() if x.startswith('-') else col(cl).asc() for x in order]
    
    # Apply sorting to the DataFrame
    res = df.orderBy(*sort_logic)

    return res

# Example
r = sortdf("col_A", ["a2", "a3", "a1"])
r.show()
© www.soinside.com 2019 - 2024. All rights reserved.