如何一起使用“ select”和“ withColumn”-Pyspark

问题描述 投票:0回答:1

我有两个数据框df1和df2

我必须加入两个数据框并创建一个新的使用df1.col1 = df2.col1进行联接,内部联接我的查询是我可以同时使用“ select”和“ withColumn”语句吗?

例如

df3 = df1.join(df2,df1.col1 = df2.col1,'inner').select(df1.col4,df2.col4).
      withColumn("col2",(df1.col1+df2.col2))
      withColumn("col3",(df1.col1/df2.col2))

我该如何实现分别选择和选择列。

Dataframe_example

apache-spark select pyspark pyspark-sql pyspark-dataframes
1个回答
0
投票

您需要在.select中选择所有必需的列,并且在[[.withColumn]中仅使用那些列

Example:

df1=spark.createDataFrame([("a","1","4","t"),("b","2","5","v"),("c","3","6","v")],["col1","col2","col3","col4"]) df2=spark.createDataFrame([("a","1","4","ord2"),("b","2","5","ord1"),("c","3","6","ord3")],["col1","col2","col3","col4"]) df1.join(df2,df1.col1 == df2.col1,'inner').select(df1.col1,df2.col2,df1.col3,df1.col2,df2.col4).withColumn("col3",(df1.col3 / df2.col2).cast("double")).withColumn("col2",(df1.col2 + df2.col2).cast("int")).show() #+----+----+----+----+----+ #|col1|col2|col3|col2|col4| #+----+----+----+----+----+ #| a| 2| 4.0| 2|ord2| #| b| 4| 2.5| 4|ord1| #| c| 6| 2.0| 6|ord3| #+----+----+----+----+----+

© www.soinside.com 2019 - 2024. All rights reserved.