因此,假设我有一个只有一列的数据帧df,其中df.show()
为| a,b,c,d,.... || a,b,c,d,.... |所以我想得到一个df1,其中df1.show()
是| a | b | c ..... |简而言之,我想将一个单列的数据框拆分为多列的数据框。所以,我想到了
split_col = pyspark.sql.functions.split(df['x'], ' '),
df=df.withColumn('0',split_col.getItem(0))
df=df.withColumn('1',split_col.getItem(1))
,依此类推但是,如果现在有大量的色谱柱,该如何遍历df=df.withColumn('i',split_col.getItem(i))
?我尝试使用pythonic版本的
for i in range(100):
df=df.withColumn(str(i),split_col.getItem(i)
)但它不起作用。有什么办法可以在pyspark中做到这一点?谢谢
因此您可以使用iterate and set name
select clause
,如下所示:
在此循环中,每次循环运行时,您将为[[hitting split
,因此效率较低。
from pyspark.sql import functions as F
df.select(*[(F.split("x",' ')[i]).alias(str(i)) for i in range(100)]).explain()
#== Physical Plan ==
#*(1) Project [split(x#200, )[0] AS 0#1708, split(x#200, )[1]
AS 1#1709, split(x#200, )[2] AS 2#1710, split(x#200, )[3] AS
3#1711, split(x#200, )[4] AS 4#1712, split(x#200, )[5] AS
5#1713, split(x#200, )[6] AS 6#1714, split(x#200, )[7] AS
7#1715, split(x#200, )[8] AS 8#1716, split(x#200, )[9] AS
9#1717, split(x#200, )[10] AS 10#1718, split(x#200, )[11] AS
11#1719, split(x#200, )[12] AS 12#1720, split(x#200, )[13] AS
13#1721, split(x#200, )[14] AS 14#1722, split(x#200, )[15] AS
15#1723, split(x#200, )[16] AS 16#1724, split(x#200, )[17] AS
17#1725, split(x#200, )[18] AS 18#1726, split(x#200, )[19] AS
19#1727, split(x#200, )[20] AS 20#1728, split(x#200, )[21] AS
21#1729, split(x#200, )[22] AS 22#1730, split(x#200, )[23] AS
23#1731, ... 76 more fields]
#+- *(1) Scan ExistingRDD[x#200]
相反,您可以将其拆分为[[once
,并仅允许火花[C0] project
one split operation as opposed to many.