我有一些巨大的 JSON 文件需要分析,这样我就可以将它们转换成一些表格。我发现
jq
在检查它们时非常有用,但是会有数百个,我对 jq
很陌生。
我已经有了一些非常方便的功能
~/.jq
(非常感谢@mikehwang)
def profile_object:
to_entries | def parse_entry: {"key": .key, "value": .value | type}; map(parse_entry)
| sort_by(.key) | from_entries;
def profile_array_objects:
map(profile_object) | map(to_entries) | reduce .[] as $item ([]; . + $item) | sort_by(.key) | from_entries;
我确定在描述我的问题后我将不得不修改它们。
我想要一条
jq
线来描述单个对象。如果键映射到对象数组,则收集对象中的唯一键,如果那里有嵌套的对象数组,则继续进行分析。如果值是对象,则分析该对象。
抱歉这个例子太长了,但想象一下这个有几个 GB:
{
"name": "XYZ Company",
"type": "Contractors",
"reporting": [
{
"group_id": "660",
"groups": [
{
"ids": [
987654321,
987654321,
987654321
],
"market": {
"name": "Austin, TX",
"value": "873275"
}
},
{
"ids": [
987654321,
987654321,
987654321
],
"market": {
"name": "Nashville, TN",
"value": "2393287"
}
}
]
}
],
"product_agreements": [
{
"negotiation_arrangement": "FFVII",
"code": "84144",
"type": "DJ",
"type_version": "V10",
"description": "DJ in a mask",
"name": "Claptone",
"negotiated_rates": [
{
"company_references": [
1,
5,
458
],
"negotiated_prices": [
{
"type": "negotiated",
"rate": 17.73,
"expiration_date": "9999-12-31",
"code": [
"11"
],
"billing_modifier_code": [
"124"
],
"billing_class": "professional"
}
]
},
{
"company_references": [
747
],
"negotiated_prices": [
{
"type": "fee",
"rate": 28.42,
"expiration_date": "9999-12-31",
"code": [
"11"
],
"billing_class": "professional"
}
]
}
]
},
{
"negotiation_arrangement": "MGS3",
"name": "David Byrne",
"type": "Producer",
"type_version": "V10",
"code": "654321",
"description": "Frontman from Talking Heads",
"negotiated_rates": [
{
"company_references": [
1,
9,
2344,
8456
],
"negotiated_prices": [
{
"type": "negotiated",
"rate": 68.73,
"expiration_date": "9999-12-31",
"code": [
"11"
],
"billing_class": "professional"
}
]
},
{
"company_references": [
679
],
"negotiated_prices": [
{
"type": "fee",
"rate": 89.25,
"expiration_date": "9999-12-31",
"code": [
"11"
],
"billing_class": "professional"
}
]
}
]
}
],
"version": "1.3.1",
"last_updated_on": "2023-02-01"
}
期望的输出:
{
"name": "string",
"type": "string",
"reporting": [
{
"group_id": "number",
"groups": [
{
"ids": [
"number"
],
"market": {
"type": "string",
"value": "string"
}
}
]
}
],
"product_agreements": [
{
"negotiation_arrangement": "string",
"code": "string",
"type": "string",
"type_version": "string",
"description": "string",
"name": "string",
"negotiated_rates": [
{
"company_references": [
"number"
],
"negotiated_prices": [
{
"type": "string",
"rate": "number",
"expiration_date": "string",
"code": [
"string"
],
"billing_modifier_code": [
"string"
],
"billing_class": "string"
}
]
}
]
}
],
"version": "string",
"last_updated_on": "string"
}
如果其中有任何错误,我真的很抱歉,但我尽量使它保持一致并且尽可能简单。
重申需求,如果值是对象或数组,则递归分析 JSON 对象中的每个键。解决方案需要独立于键名。如果需要,很高兴进一步澄清。
jq 模块 schema.jq 位于 https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed 旨在产生您描述的那种结构模式。
对于非常大的输入,它可能会非常慢,因此如果 JSON 足够规则,则可以使用混合策略 - 分析足够多的数据以得出全面的结构模式,然后检查它是否确实如此申请。
对于结构模式的一致性测试,例如由 schema.jq 生成的,请参见https://github.com/pkoppstein/JESS
鉴于您的 input.json,这是一个解决方案:
jq '
def schema:
if type == "object" then .[] |= schema
elif type == "array" then map(schema)|unique
| if (first | type) == "object" then [add] else . end
else type
end;
schema
' input.json
这是@Philippe 解决方案的一个变体:它以一种原则性但有损的方式将
map(schema)
中的对象合并为数组。 (所有这些半解决方案以速度换取精度损失。)
注意下面用了
keys_unsorted
;如果使用 gojq,则必须将其更改为keys
,或者提供 keys_unsorted 的 def。
# Use "JSON" as the union of two distinct types
# except combine([]; [ $x ]) => [ $x ]
def combine($a;$b):
if $a == $b then $a elif $a == null then $b elif $b == null then $a
elif ($a == []) and ($b|type) == "array" then $b
elif ($b == []) and ($a|type) == "array" then $a
else "JSON"
end;
# Profile an array by calling mergeTypes(.[] | schema)
# in order to coalesce objects
def mergeTypes(s):
reduce s as $t (null;
if ($t|type) != "object" then .types = (.types + [$t] | unique)
else .object as $o
| .object = reduce ($t | keys_unsorted[]) as $k ($o;
.[$k] = combine( $t[$k]; $o[$k] )
)
end)
| (if .object then [.object] else null end ) + .types ;
def schema:
if type == "object" then .[] |= schema
elif type == "array"
then if . == [] then [] else mergeTypes(.[] | schema) end
else type
end;
schema
例子: 输入:
{"a": [{"b":[1]}, {"c":[2]}, {"c": []}] }
输出:
{
"a": [
{
"b": [
"number"
],
"c": [
"number"
]
}
]
}