在Python中使用ijson将非常大(250GB+)的json文件转换为csv

问题描述 投票:0回答:1

我正在尝试将一个非常大(超过 250GB)的 json 文件转换为 csv; json 文件如下所示:

{
"BuildingSiteList":[
    {
    "ID": "00001"
    (34 more attributes)
    },
    {
    "ID": "00002"
    (34 more attributes)
    }
},
{
"BuildingSiteOwnershipList":[
    {
    "Owner": "xyz"
    (14 more attributes)
    },
    {
    "Owner": "abc"
    (14 more attributes)
    }
},
{
"BuildingsList":[
    {
    "ID":"000003"
    "LocalityCode":"01"
    (89 more attributes)
    },
    {
    "ID":"000004"
    "LocalityCode":"03"
    (89 more attributes)
    },
    {
    etc.
    }
},
and so on

我只对“BuildingsList”分支下的部分数据感兴趣,因此我当前正在使用此 python 代码来查找所需的属性并将它们放入 csv 文件中:

import ijson
import csv

input_file_path = 'path to json file' output_file_path = 'path to csv file'
 # List the fieldnames you want to include in the CSV file
desired_fieldnames = [ "ID" "localityCode" "Coordinates" "BuildingType" "RentalStatus" ]

 # Buffer to store rows before writing to CSV
buffer_size = 1000000 chunky = 100000 rows_buffer = []
def write_buffer(writer, buffer): for row in buffer: writer.writerow(row)
with open(input_file_path, 'rb') as input_file, open(output_file_path, 'w', newline='', encoding='utf-8') as output_file: objects = ijson.items(input_file, "BuildingsList") writer = csv.DictWriter(output_file, fieldnames=desired_fieldnames) writer.writeheader()

for item in objects:
    # Create a new dictionary with only the desired fields
    for x in range (len(item)):
        filtered_item = {field: item[x].get(field, '') for field in desired_fieldnames}
        rows_buffer.append(filtered_item)

    if len(rows_buffer) >= buffer_size:
        write_buffer(writer, rows_buffer)
        rows_buffer = []

# Write any remaining rows in the buffer
if rows_buffer:
    write_buffer(writer, rows_buffer)

这在较小版本的数据集(60kb)上运行良好,但当我尝试在大文件上使用它时,它会导致我的计算机崩溃。我认为这是因为 ijson.items 尝试处理整个数据集。

我尝试使用 pandas 将 json 文件分割成可管理的块,但无法让它工作。

python json csv large-data ijson
1个回答
0
投票

我根本不知道

ijson
,但如果它流读取json,这样你就不必读取整个有效负载,那么我可能希望它能工作。

import ijson
import csv

input_file_path = 'path to json file'
output_file_path = 'path to csv file'
desired_fieldnames = [ "ID" "localityCode" "Coordinates" "BuildingType" "RentalStatus" ]

with open(input_file_path, 'rb') as input_file, open(output_file_path, 'w', newline='', encoding='utf-8') as output_file:
    writer = csv.DictWriter(output_file, fieldnames=desired_fieldnames, extrasaction="ignore")
    writer.writeheader()
    for item in ijson.items(input_file, "BuildingsList"):
        writer.writerow(item)
© www.soinside.com 2019 - 2024. All rights reserved.