我有一个包含多个列的 pySpark 数据框和一个包含其中一个列项目的列表。我想按给定列表的顺序对行进行排序。
col_A | col_B | col_c |
---|---|---|
a1 | b1 | c1 |
a2 | b2 | c2 |
a3 | b3 | b3 |
col_A_itm_order = ['a2', 'a3', 'a1']
预期产量
col_A | col_B | col_c |
---|---|---|
a2 | b2 | c2 |
a3 | b3 | c3 |
a1 | b1 | b1 |
我发现 Pandas 数据框有类似的问题,但 PySpark 没有。
from pyspark.sql import SparkSession as ss
from pyspark.sql.functions import col
# Assuming spark is already created
spark = ss.builder.appName("Sortdf").getOrCreate()
# DataFrame
data = [("a1", "b1", "c1"),
("a2", "b2", "c2"),
("a3", "b3", "c3")]
df = spark.createDataFrame(data, ["col_A", "col_B", "col_C"])
def sortdf(cl, order):
# Create the sort_logic dynamically
sort_logic = [col(cl).desc() if x.startswith('-') else col(cl).asc() for x in order]
# Apply sorting to the DataFrame
res = df.orderBy(*sort_logic)
return res
# Example
r = sortdf("col_A", ["a2", "a3", "a1"])
r.show()