如何在pyspark中遍历数据框的多个列?

问题描述 投票:1回答:1

因此,假设我有一个只有一列的数据帧df,其中df.show()为| a,b,c,d,.... || a,b,c,d,.... |所以我想得到一个df1,其中df1.show()是| a | b | c ..... |简而言之,我想将一个单列的数据框拆分为多列的数据框。所以,我想到了

split_col = pyspark.sql.functions.split(df['x'], ' '),
df=df.withColumn('0',split_col.getItem(0))
df=df.withColumn('1',split_col.getItem(1))

,依此类推但是,如果现在有大量的色谱柱,该如何遍历df=df.withColumn('i',split_col.getItem(i))?我尝试使用pythonic版本的

for i in range(100):
    df=df.withColumn(str(i),split_col.getItem(i)

)但它不起作用。有什么办法可以在pyspark中做到这一点?谢谢

python dataframe pyspark
1个回答
0
投票

因此您可以使用iterate and set name select clause,如下所示:

在此循环中,每次循环运行时,您将为[[hitting split,因此效率较低。

from pyspark.sql import functions as F df.select(*[(F.split("x",' ')[i]).alias(str(i)) for i in range(100)]).explain() #== Physical Plan == #*(1) Project [split(x#200, )[0] AS 0#1708, split(x#200, )[1] AS 1#1709, split(x#200, )[2] AS 2#1710, split(x#200, )[3] AS 3#1711, split(x#200, )[4] AS 4#1712, split(x#200, )[5] AS 5#1713, split(x#200, )[6] AS 6#1714, split(x#200, )[7] AS 7#1715, split(x#200, )[8] AS 8#1716, split(x#200, )[9] AS 9#1717, split(x#200, )[10] AS 10#1718, split(x#200, )[11] AS 11#1719, split(x#200, )[12] AS 12#1720, split(x#200, )[13] AS 13#1721, split(x#200, )[14] AS 14#1722, split(x#200, )[15] AS 15#1723, split(x#200, )[16] AS 16#1724, split(x#200, )[17] AS 17#1725, split(x#200, )[18] AS 18#1726, split(x#200, )[19] AS 19#1727, split(x#200, )[20] AS 20#1728, split(x#200, )[21] AS 21#1729, split(x#200, )[22] AS 22#1730, split(x#200, )[23] AS 23#1731, ... 76 more fields] #+- *(1) Scan ExistingRDD[x#200]
相反,您可以将其拆分为[[once
,并仅允许火花[

C0] project

one split operation as opposed to many.
© www.soinside.com 2019 - 2024. All rights reserved.