我的数据如下-
{
"Id": "01d3050e",
"Properties": "{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",
"LastUpdated": 1581530000000,
"LastUpdatedBy": "System"
}
使用aws胶,我想关联“属性”列,但是由于数据类型是字符串,因此无法完成。将其转换为struct可能会基于阅读此博客来完成-
>>> df.show
<bound method DataFrame.show of DataFrame[Id: string, LastUpdated: bigint, LastUpdatedBy: string, Properties: string]>
>>> df.show()
+--------+-------------+-------------+--------------------+
| Id| LastUpdated|LastUpdatedBy| Properties|
+--------+-------------+-------------+--------------------+
|01d3050e|1581530000000| System|{"choices":null,"...|
+--------+-------------+-------------+--------------------+
我如何使用关系转换器或pyspark中的任何UDF取消嵌套“属性”列以将其分为“选择”,“对象”,“数据库”和“时间戳”列。
from pyspark.sql import functions as F
list=[["01d3050e","{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",1581530000000,"System"]]
df=spark.createDataFrame(list, ['Id','Properties','LastUpdated','LastUpdatedBy'])
df.show(truncate=False)
+--------+----------------------------------------------------------------------------+-------------+-------------+
|Id |Properties |LastUpdated |LastUpdatedBy|
+--------+----------------------------------------------------------------------------+-------------+-------------+
|01d3050e|{"choices":null,"object":"demo","database":"pg","timestamp":"1581534117303"}|1581530000000|System |
+--------+----------------------------------------------------------------------------+-------------+-------------+
无需使用UDF,内置功能就足够了,并且已经针对大数据任务进行了非常优化。
df.withColumn("Properties", F.split(F.regexp_replace(F.regexp_replace((F.regexp_replace("Properties",'\{|}',"")),'\:',','),'\"|"',"").cast("string"),','))\
.withColumn("choices", F.element_at("Properties",2))\
.withColumn("object", F.element_at("Properties",4))\
.withColumn("database",F.element_at("Properties",6))\
.withColumn("timestamp",F.element_at("Properties",8)).drop("Properties").show()
+--------+-------------+-------------+-------+------+--------+-------------+
| Id| LastUpdated|LastUpdatedBy|choices|object|database| timestamp|
+--------+-------------+-------------+-------+------+--------+-------------+
|01d3050e|1581530000000| System| null| demo| pg|1581534117303|
+--------+-------------+-------------+-------+------+--------+-------------+