我有这个超级简单的数据框:
rc1.show(5)
rc1.printSchema()
+--------+-----------+
| ID|Case number|
+--------+-----------+
|11034701| JA366925|
|11227287| JB147188|
|11227583| JB147595|
|11227293| JB147230|
|11227634| JB147599|
+--------+-----------+
only showing top 5 rows
root
|-- ID: string (nullable = true)
|-- Case number: string (nullable = true)
我想添加一个新列,它只是“ Case number”列和“ aaa”的串联,所以我正在使用它来做到这一点:
rc2 = rc1.withColumn("Case numberxx", col("Case number") + "aaa")
rc2.show(5)
但是,对于我的一生,我无法理解为什么我的新列中充满了空值:
+--------+-----------+-------------+
| ID|Case number|Case numberxx|
+--------+-----------+-------------+
|11034701| JA366925| null|
|11227287| JB147188| null|
|11227583| JB147595| null|
|11227293| JB147230| null|
|11227634| JB147599| null|
+--------+-----------+-------------+
only showing top 5 rows
为什么会这样?谢谢!
好的,这很好:
from pyspark.sql.functions import concat, lit
rc2 = rc1.withColumn("Case numberxx", concat(col("Case number"), lit("aaa")))
rc2.show(5)
+--------+-----------+-------------+
| ID|Case number|Case numberxx|
+--------+-----------+-------------+
|11034701| JA366925| JA366925aaa|
|11227287| JB147188| JB147188aaa|
|11227583| JB147595| JB147595aaa|
|11227293| JB147230| JB147230aaa|
|11227634| JB147599| JB147599aaa|
+--------+-----------+-------------+
但是,我不太清楚为什么它为null:
col("Case number") + lit("aaa")
但是没关系
concat(col("Case number"), lit("aaa"))
好的,这很好:
from pyspark.sql.functions import concat, lit
rc2 = rc1.withColumn("Case numberxx", concat(col("Case number"), lit("aaa")))
rc2.show(5)
+--------+-----------+-------------+
| ID|Case number|Case numberxx|
+--------+-----------+-------------+
|11034701| JA366925| JA366925aaa|
|11227287| JB147188| JB147188aaa|
|11227583| JB147595| JB147595aaa|
|11227293| JB147230| JB147230aaa|
|11227634| JB147599| JB147599aaa|
+--------+-----------+-------------+
但是,我不太清楚为什么它为null:
col("Case number") + lit("aaa")
但是没关系
concat(col("Case number"), lit("aaa"))