我正在使用python中的Spark从XML文件创建数据帧。我想要做的是将每行中的值转换为新列并创建虚拟变量。
这是一个例子。
输入:
id | classes |
-----+--------------------------+
132 | economics,engineering |
201 | engineering |
123 | sociology,philosophy |
222 | philosophy |
--------------------------------
输出:
id | economics | engineering | sociology | philosophy
-----+-----------+-------------+-----------+-----------
132 | 1 | 1 | 0 | 0
201 | 0 | 1 | 0 | 0
123 | 0 | 0 | 1 | 1
222 | 0 | 0 | 0 | 1
--------------------------------------------------------
将列分解为多行ref:Explode in PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([(132, "economics,engineering"),(201, "engineering"),(123, "sociology,philosophy"),(222, "philosophy")], ["id", "classes"])
+---+--------------------+
| id| classes|
+---+--------------------+
|132|economics,enginee...|
|201| engineering|
|123|sociology,philosophy|
|222| philosophy|
+---+--------------------+
explodeCol = df.select(col("id"), F.explode(F.split(col("classes"), ",")).alias("branch"))
+---+-----------+
| id| branch|
+---+-----------+
|132| economics|
|132|engineering|
|201|engineering|
|123| sociology|
|123| philosophy|
|222| philosophy|
+---+-----------+
explodeCol.groupBy("id").pivot("branch").agg(F.sum(lit(1))).na.fill(0).show()
+---+---------+-----------+----------+---------+
| id|economics|engineering|philosophy|sociology|
+---+---------+-----------+----------+---------+
|222| 0| 0| 1| 0|
|201| 0| 1| 0| 0|
|132| 1| 1| 0| 0|
|123| 0| 0| 1| 1|
+---+---------+-----------+----------+---------+
有关更详细的Spark文档,请参阅http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html