加载和清理一个非常大的 JSON 文件

问题描述 投票:0回答:1

我正在使用 Snapshot Serengeti 数据集进行图像分类项目。该数据集附带一个非常大的 JSON 文件 (5GB+),其中包含顶级键。我特别需要 "images": [{...}, {...}, ...] 数组中包含的值进行训练。该文件太大,我无法直接打开并读取或存储到字典中。

文件中的图像条目格式如下:

{
"id": "S1/B04/B04_R1/S1_B04_R1_PICT0003",
"file_name": "S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG",
"frame_num": 1,
"seq_id": "SER_S1#B04#1#3",
"width": 2048,
"height": 1536,
"corrupt": false,
"location": "B04",
"seq_num_frames": 1,
"datetime": "2010-07-20 06:14:06"
},

我试图以 100MB 的块循环遍历文件,但文件也有格式问题(单引号、NaN 值),需要首先解决,否则会抛出错误。我试过的代码如下

with open(labels_json) as f:         
    for chunk in iter(lambda: f.read(100*1024*1024), ""):         
    data = json.loads(chunk)

由于图像被组织成 11 个季节,我尝试将数据写入 11 个单独的文件,这些文件可以使用下面的脚本单独加载,但是云存储甚至在存储一个季节之前就被吃光了。我对这样的数据存储问题不熟悉,所以我的脚本中肯定存在导致文件写入效率低下的问题。非常感谢任何帮助。

import json

labels_json = annotations_directory + "SS_Labels.json"

get_filename = lambda n : f"SS_labels_S{i}.json"

# Define the 11 output files
seasons = {}
started = {}
for i in range(1, 12):
    filename = get_filename(i)
    seasons[i] = open(filename, "w")
    seasons[i].write('[')
    started[i] = False

def seperate_seasons(dir):
    line_num = 0
    decoder = json.JSONDecoder()
    with open(dir, 'r') as labels:
        begin_writing = False
        buffer = []
        id = 1
        for line in labels:
            if not begin_writing: # Begin writing for the line after "images"
                if 'images' in line:
                    begin_writing = True
            else:
                line.replace('NaN', 'null') # clean NaN values
                line.replace("'", '"')      # clean incorrect key values

                buffer.append(line.strip()) # add line to buffer

                getID = lambda l: int(line.split('"')[3].split('/')[0][1])
                if '"id"' in line or "'id'" in line:
                    previous_id = id
                    id = getID(line)        # get id of object

                if line.strip() == '},' or line.strip() == '}': # when the object has finished, write it to the appropriate image folder
                    label = ','.join(buffer)
                    if label[-1] != ',':
                        label += ','

                    if started[id] == False:
                        print(f'Beginning Season {id}')
                        started[id] = True

                        if id != 1:
                            seasons[previous_id].write(']')
                            seasons[previous_id].close()
                            del seasons[previous_id]


                    seasons[id].write(label)                    # add label entry to file

seperate_seasons(labels_json)

# Close all remaining label files
for season in seasons.values():
    season.write(']')
    season.close()
python json data-cleaning large-files
1个回答
0
投票

如果您没有 RAM 来将文件加载到内存中(如果您没有,我不会怪您),您可以使用几个额外的库将数据拆分为更易于管理的文件。

这使用

json-stream
用于 JSON 流式传输,以及
orjson
用于更快的 JSON 编码器,以及
tqdm
用于进度条。

输入是原始的单个 JSON 文件,输出文件夹

out/
最终将包含来自 JSON 的信息和类别数据,以及 JSONL(又名 JSON 行,又名 ND-JSON)文件(即 JSON 对象,每行一个)ála

{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0001","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0001.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#1","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-18 16:26:14"}
{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0002","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0002.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#2","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-18 16:26:30"}
{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0003","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#3","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-20 06:14:06"}
{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0004","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0004.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#4","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-22 08:56:06"}

JSONL 文件很容易被许多工具处理,并且也可以使用简单的 for 循环在 Python 中进行解析。如果愿意,您可以将

open
替换为
gzip.open
以随时压缩 JSONL 文件。

json_stream
API 有点挑剔,但在这里你可以 - 在我的机器上工作(json-stream==2.3.0)。

在我的笔记本电脑上,tqdm 报告每秒处理 29594 张图像。

import json_stream
import orjson
import tqdm


def read_images(value):
    jsonl_files = {}

    with tqdm.tqdm(value, unit="image") as pbar:
        for image in pbar:
            image = dict(image)
            prefix = "/".join(image["id"].split("/")[:2])

            filename = f"out/{prefix.replace('/', '_')}.jsonl"

            if filename not in jsonl_files:
                jsonl_files[filename] = open(filename, "ab")
                if len(jsonl_files) > 50:
                    jsonl_files.popitem()[1].close()
                pbar.set_description(f"Writing {filename}")

            jsonl_files[filename].write(orjson.dumps(image))
            jsonl_files[filename].write(b"\n")


def main():
    with open("/Users/akx/Downloads/SnapshotSerengeti_S1-11_v2.1.json", "rb") as f:
        data = json_stream.load(f)
        for key, value in data.items():
            if key == "info":
                value = dict(value.persistent().items())
                with open(f"out/info.json", "wb") as f:
                    f.write(orjson.dumps(value))
            elif key == "categories":
                value = [dict(d) for d in value.persistent()]
                with open(f"out/categories.json", "wb") as f:
                    f.write(orjson.dumps(value))
            elif key == "images":
                read_images(value.persistent())


if __name__ == "__main__":
    main()
© www.soinside.com 2019 - 2024. All rights reserved.