将具有动态模式列的 JSON 字符串转换为 SparkSQL 中的 Array<String> 形式

问题描述 投票:0回答:0

我们有一个包含具有动态模式的 JSON 数据的列,它是

StringType
。我们希望将字符串转换为 JSON 对象数组。我们不能使用
array()
,因为它只会将整个字符串放在数组的第 0 个索引处。
from_json
也需要一个固定的模式,似乎不适合这种情况。有谁知道我们该怎么做?这是示例 JSON 字符串 - 请记住,这只是 1 行中的字符串,我们有多个行,例如:

[
    {
        "name": "additional_product_information",
        "value":
        [
            {
                "valueType": "str",
                "text": "Faux-fur booties. Memory foam insoles. Soft and comfy, like walking on air. ...",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[2]/article[1]/section[1]/div[6]/div[2]",
                    "attributes":
                    []
                }
            },
            {
                "valueType": "str",
                "text": "Faux-fur booties. Memory foam insoles. Soft and comfy, like walking on air.",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[2]/article[1]/section[1]/div[6]/div[3]",
                    "attributes":
                    []
                }
            },
            {
                "valueType": "str",
                "text": "100% Polyester.",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[2]/article[1]/section[1]/div[6]/div[4]",
                    "attributes":
                    []
                }
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "brand_name",
        "value":
        [
            {
                "valueType": "str",
                "text": "White Stuff",
                "element":
                {
                    "location": "/html[0]",
                    "attributes":
                    []
                }
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "bread_crumb1",
        "value":
        [
            {
                "valueType": "str",
                "text": "Womenswear",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[2]/article[1]",
                    "attributes":
                    []
                }
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "bread_crumb2",
        "value":
        [
            {
                "valueType": "str",
                "text": "Slippers",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[2]/article[1]",
                    "attributes":
                    []
                }
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "bread_crumb3",
        "value":
        [],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "color_name",
        "value":
        [],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "customer_star_ratings",
        "value":
        [],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "has_retail_offer",
        "value":
        [
            {
                "valueType": "str",
                "text": "D"
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "ASSISTED_SCRAPE"
    },
    {
        "name": "image_url",
        "value":
        [
            {
                "valueType": "str",
                "text": "https://xcdn.next.co.uk/COMMON/Items/Default/Default/ItemImages/AltItemZoom/M71395s.jpg",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[1]/section[1]/div[1]/div[3]/div[1]/div[2]/ul[1]/li[1]",
                    "attributes":
                    []
                }
            },
            {
                "valueType": "str",
                "text": "https://xcdn.next.co.uk/COMMON/Items/Default/Default/ItemImages/AltItemZoom/M71395s2.jpg",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[1]/section[1]/div[1]/div[3]/div[1]/div[2]/ul[1]/li[2]",
                    "attributes":
                    []
                }
            },
            {
                "valueType": "str",
                "text": "https://xcdn.next.co.uk/COMMON/Items/Default/Default/ItemImages/AltItemZoom/M71395s3.jpg",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[2]/div[1]/section[1]/section[1]/div[1]/div[3]/div[1]/div[2]/ul[1]/li[3]",
                    "attributes":
                    []
                }
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "is_refurbished",
        "value":
        [
            {
                "valueType": "str",
                "text": "N"
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "ASSISTED_SCRAPE"
    },
    {
        "name": "item_length_width_height_weight",
        "value":
        [],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    },
    {
        "name": "item_number",
        "value":
        [
            {
                "valueType": "str",
                "text": "M71395",
                "element":
                {
                    "location": "/html[0]/body[1]/section[1]/section[1]/div[1]/div[2]/div[5]",
                    "attributes":
                    []
                }
            }
        ],
        "sourceValue":
        [],
        "exactSourceValue": true,
        "inferredFrom":
        [],
        "inferredFromSource":
        [],
        "strategyId": "Config"
    }
]

这是我想到的两种可能的解决方案:

  1. 解析 json,然后将数组中的每个元素编码回一个字符串(所以我最终得到一个

    array<string>
    列)

  2. 将模式定义为:name 和 value[text:] 因为这是我需要提取的 2 个数据项,然后使用

    from_json
    函数。 [目前首选的解决方案]

如果你们有更好的方法或对上述解决方案有更好的实施想法,请告诉我。非常感谢:)

sql apache-spark apache-spark-sql aws-glue database-schema
© www.soinside.com 2019 - 2024. All rights reserved.