从看起来像 JSON 的字符串列创建新列

问题描述 投票:0回答:1

我得到了一个有1列的表,它是字符串类型,但内部看起来像json类型。

值看起来像这样

'old_id' 具有值的列

[{"name":"Entitas Penugasan","id":"6415","value":"HIJRA"},
 {"name":"Function","id":"10594","value":"People & Culture"},
 {"name":"Unit","id":"10595","value":"Organization Development"},
 {"name":"Tribe","id":"10602","value":"Shared Service"}
]

'new_id' 具有值的列

[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
 {"name":"Function","id":"10594","value":"Finance"},
 {"name":"Unit","id":"10595","value":"Finance Operations"},
 {"name":"Tribe","id":"10602","value":"Commercial"}
]

我需要 sql athena 查询来从这些 json 列中创建列 old_name、old_id、old_value、new_name、new_id、new_value

我尝试过使用

REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value

但它只生成 1 行,即使在列中,它显示 4 个“数组”

查询应生成 4 行,如下所示

id 旧名 新名称 旧_id 新_id 旧值 新值
A Entitas Penugasan Entitas Penugasan 6415 6415 海吉拉 AFS
A 功能 功能 10594 10594 人文与文化 金融
A 单位 单位 10595 10595 组织发展 财务运营
A 部落 部落 10602 10602 共享服务 商业

有没有办法在 SQL Athena 中做到这一点?


编辑:我在下面的查询中取得了一些进展

with raw_data as(
select id, user_id, old_custom_fields, new_custom_fields
    from my_table
    where 
    -- new_custom_fields <> '' and new_custom_fields<> 'None' and new_custom_fields is not null and
    id in (A)
),
splitted_data as (
    SELECT id, user_id,
    split(old_custom_fields, '},{') AS old_custom_field_id,
    split(new_custom_fields, '},{') AS new_custom_field_id
  FROM my_table
),
old_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(old_custom_field_id) AS t (_old_custom_fields)
),
new_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(new_custom_field_id) AS t (_new_custom_fields)
),
old_custom_field_cleaned as (
    select id, old_custom_field_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value
    from old_custom_field_id_unnest
),
new_custom_field_cleaned as (
    select id, new_custom_field_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
    from new_custom_field_id_unnest
)
select oc.id, old_name, new_name,
    old_id,new_id,
    old_value,new_value
    from old_custom_field_cleaned oc 
    join new_custom_field_cleaned nc on oc.id = nc.id

但这会导致重复的行,现在由于连接我有 16 行,仍然需要帮助来删除不需要的行

sql arrays json amazon-web-services amazon-athena
1个回答
0
投票

您的数据不只是看起来像 JSON,示例数据也是 JSON,因此请将其作为一个数据来处理。根据所使用的 Presto/Trino 版本,实际处理可能会有所不同,但共同的部分始终是 - 解析 JSON 并将其转换为某种类型的

array
,然后取消嵌套。例如,您可以使用
ROW(name varchar, id varchar, value varchar)
(替代方案可以是
JSON
MAP(varchar, varchar)
MAP(varchar, JSON)
,具体取决于情况):

-- sample data
WITH dataset(old_id, new_id) AS (
    VALUES
        ('[{"id":"6415","value":"HIJRA", "name":"Entitas Penugasan"},
             {"name":"Function","id":"10594","value":"People & Culture"},
             {"name":"Unit","id":"10595","value":"Organization Development"},
             {"name":"Tribe","id":"10602","value":"Shared Service"}
          ]',
          '[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
             {"name":"Function","id":"10594","value":"Finance"},
             {"name":"Unit","id":"10595","value":"Finance Operations"},
             {"name":"Tribe","id":"10602","value":"Commercial"}
            ]'
         )
)

-- query
SELECT t.*
FROM dataset,
unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar))),
    cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(old_name, old_id, old_value, new_name, new_id, new_value); -- maybe as t(old, new) depending on engine and select t.old.name as old_name, ...

输出:

旧名 旧_id 旧值 新名称 新_id 新值
Entitas Penugasan 6415 海吉拉 Entitas Penugasan 6415 海吉拉
功能 10594 人文与文化 功能 10594 人文与文化
单位 10595 组织发展 单位 10595 组织发展
部落 10602 共享服务 部落 10602 共享服务

上面的代码来自假设数组具有正确顺序的“相同”数据。就我个人而言,我会考虑根据 id 加入:

-- sample data
WITH dataset(old_id, new_id) AS (
   -- ..
),

-- query 
old_values as (
    SELECT t.*
    FROM dataset,
    unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
),
new_values as (
    SELECT t.*
    FROM dataset,
    unnest(cast(json_parse(new_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
)

SELECT o.id,
       o.name old_name,
       o.value old_value,
       n.name new_name,
       n.value new_value
FROM old_values as o
full outer join new_values as n on o.id = n.id;

输出:

id 旧名 旧值 新名称 新值
6415 Entitas Penugasan 海吉拉 Entitas Penugasan AFS
10594 功能 人文与文化 功能 金融
10595 单位 组织发展 单位 财务运营
10602 部落 共享服务 部落 商业
© www.soinside.com 2019 - 2024. All rights reserved.