使用 pyspark 从具有键值对(empval)的 json 对象的嵌套数组中删除 id 字段
输入
+----------+--------+----------------------------------------------------------------------------------------------------------+
| empno | empcode| empval |
+----------+--------+----------------------------------------------------------------------------------------------------------+
| employee1| 100DRE | [{"id": "123", "key1": "value1", "key2": "value2"}, {"id": "234", "key1": "te", "key2": "value2"}, {"id": "345", "key1": "grtregert", "key2": "value2"}] |
+----------+--------+----------------------------------------------------------------------------------------------------------+
预期产量
+----------+--------+---------------------------------------------------------------------------------------------------------------------+
| empno | empcode| newColumn |
+----------+--------+---------------------------------------------------------------------------------------------------------------------+
| employee1| 100DRE | [{"key1": "value1", "key2": "value2"}, {"key1": "te", "key2": "value2"}, {"key1": "grtregert", "key2": "value2"}]|
+----------+--------+---------------------------------------------------------------------------------------------------------------------+
简单,使用
from_json
函数并将所需的 array<struct<key1: string, key2: string>>
模式传递给它。
df.show(False)
+---------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|empno |empcode|empval |
+---------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|employee1|100DRE |[{"id": "123", "key1": "value1", "key2": "value2"}, {"id": "234", "key1": "te", "key2": "value2"}, {"id": "345", "key1": "grtregert", "key2": "value2"}]|
+---------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
df
.selectExpr(
"empno",
"empcode",
"to_json(from_json(empval, 'array<struct<key1: string, key2: string>>')) AS newColumn"
)
.show(False)
+---------+-------+------------------------------------------------------------------------------------------------------+
|empno |empcode|newColumn |
+---------+-------+------------------------------------------------------------------------------------------------------+
|employee1|100DRE |[{"key1":"value1","key2":"value2"},{"key1":"te","key2":"value2"},{"key1":"grtregert","key2":"value2"}]|
+---------+-------+------------------------------------------------------------------------------------------------------+