我有两个数据框df1和df2
我必须加入两个数据框并创建一个新的使用df1.col1 = df2.col1进行联接,内部联接我的查询是我可以同时使用“ select”和“ withColumn”语句吗?
例如
df3 = df1.join(df2,df1.col1 = df2.col1,'inner').select(df1.col4,df2.col4).
withColumn("col2",(df1.col1+df2.col2))
withColumn("col3",(df1.col1/df2.col2))
我该如何实现分别选择和选择列。
您需要在.select
中选择所有必需的列,并且在[[.withColumn
]中仅使用那些列Example:
df1=spark.createDataFrame([("a","1","4","t"),("b","2","5","v"),("c","3","6","v")],["col1","col2","col3","col4"])
df2=spark.createDataFrame([("a","1","4","ord2"),("b","2","5","ord1"),("c","3","6","ord3")],["col1","col2","col3","col4"])
df1.join(df2,df1.col1 == df2.col1,'inner').select(df1.col1,df2.col2,df1.col3,df1.col2,df2.col4).withColumn("col3",(df1.col3 / df2.col2).cast("double")).withColumn("col2",(df1.col2 + df2.col2).cast("int")).show()
#+----+----+----+----+----+
#|col1|col2|col3|col2|col4|
#+----+----+----+----+----+
#| a| 2| 4.0| 2|ord2|
#| b| 4| 2.5| 4|ord1|
#| c| 6| 2.0| 6|ord3|
#+----+----+----+----+----+