使用 jq 流式传输大型 geojson

问题描述 投票:0回答:1

我有一些大的geojson(请参阅

California.geojson
有一个示例https://github.com/microsoft/USBuildingFootprints?tab=readme-ov-file#download-links)。我想转换为
.csv
(使用
\copy
在 Postgres 中导入速度更快)。我将其传递给 STDIN。

它们是

"type": "Polygon"
,但对于我的需求来说,这主要是点数据,我不需要完整的几何图形(也不需要属性)。
jq
非常适合这项任务。

可悲的是,一些最大的文件似乎太大而无法存储在内存中(该进程生成了“已杀死”消息)。我尝试了

--stream
论点,但我没能理解它,或者过程似乎很慢(超过3小时但仍在“运行”)。

可以通过以下方式制作样本(请参阅本文底部以获取其副本):

jq '.features = .features[:5]' data/Alabama.geojson > sample.geojson 

这对于“较小的”geojson 非常有用(< 1.4 GB):

jq '.features | map(.geometry.coordinates) | map(.[]) | map(first) | .[] | {"long": first, "lat": last} | [.long, .lat] | @csv' small.geojson

但是我收到了一条“已杀死”消息(我假设我的内存不足)

然后我尝试了

--stream
,我不确定我是否理解正确(这个post和这个issue有很大帮助)

这是我使用 --stream 的版本(很多“黑客”)

cat sample.geojson | jq --stream "fromstream(1|truncate_stream(inputs))" | jq ' map(.geometry.coordinates) | map(.[]) | map(first) | .[] | {"long": first, "lat": last} | [.long, .lat] | @csv'

它适用于sample.geojson,但在大型geojson(例如“Ohio.geojson”)上失败。有什么想法吗?

我也尝试写入文件,但没有成功。

geojson 示例:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.959634,
              32.421887
            ],
            [
              -84.95982,
              32.421889
            ],
            [
              -84.959822,
              32.421797
            ],
            [
              -84.959767,
              32.421796
            ],
            [
              -84.959767,
              32.421771
            ],
            [
              -84.959636,
              32.421769
            ],
            [
              -84.959634,
              32.421887
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.959636,
              32.42095
            ],
            [
              -84.959715,
              32.42095
            ],
            [
              -84.959714,
              32.420984
            ],
            [
              -84.959816,
              32.420985
            ],
            [
              -84.959818,
              32.420849
            ],
            [
              -84.959637,
              32.420848
            ],
            [
              -84.959636,
              32.42095
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.959998,
              32.235231
            ],
            [
              -84.959877,
              32.235231
            ],
            [
              -84.959877,
              32.235288
            ],
            [
              -84.959998,
              32.235288
            ],
            [
              -84.959998,
              32.235231
            ]
          ]
        ]
      },
      "properties": {
        "release": 1,
        "capture_dates_range": ""
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.960253,
              32.422248
            ],
            [
              -84.960069,
              32.422245
            ],
            [
              -84.960067,
              32.422321
            ],
            [
              -84.960165,
              32.422323
            ],
            [
              -84.960164,
              32.422364
            ],
            [
              -84.96025,
              32.422365
            ],
            [
              -84.960253,
              32.422248
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              -84.961602,
              32.419206
            ],
            [
              -84.961599,
              32.419354
            ],
            [
              -84.961707,
              32.419355
            ],
            [
              -84.961708,
              32.419291
            ],
            [
              -84.961794,
              32.419292
            ],
            [
              -84.961796,
              32.419208
            ],
            [
              -84.961602,
              32.419206
            ]
          ]
        ]
      },
      "properties": {
        "release": 2,
        "capture_dates_range": "3/26/2020-7/22/2020"
      }
    }
  ]
}

jq geojson
1个回答
0
投票

您的原始过滤器可以简化如下:

  • .features
    保持原样
  • map(.geometry.coordinates) | map(.[]) | map(first)
    实际上只是
    map(.geometry.coordinates | .[] | first)
  • map(…) | .[]
    然后可以减少为
    .[] | …
  • {"long": first, "lat": last} | [.long, .lat]
    构建一个对象并立即将其转换为数组。这可以简化为
    [first,last]
    但由于数组只有两个项目开始,这部分实际上只是返回输入,并且可以完全删除
  • @csv
    保持原样

总而言之,您可以使用以下方法实现相同的效果

.features[].geometry.coordinates[][0] | @csv

这会深入五个级别,然后选择第一个项目,并将其转换为 CSV 输出。所以,这可以翻译成

--stream
版本,如下:

fromstream(5|truncate_stream(inputs))[0] | @csv
-114.127454,34.265674
-114.127694,34.260939
-114.127988,34.264977
-114.129007,34.260229
-114.129611,34.261105
-114.130311,34.263922
-114.131834,34.284069
-114.132183,34.28509
-114.132634,34.281492
-114.133764,34.282816
:
© www.soinside.com 2019 - 2024. All rights reserved.