我正在尝试将一个非常大(超过 250GB)的 json 文件转换为 csv; json 文件如下所示:
{
"BuildingSiteList":[
{
"ID": "00001"
(34 more attributes)
},
{
"ID": "00002"
(34 more attributes)
}
},
{
"BuildingSiteOwnershipList":[
{
"Owner": "xyz"
(14 more attributes)
},
{
"Owner": "abc"
(14 more attributes)
}
},
{
"BuildingsList":[
{
"ID":"000003"
"LocalityCode":"01"
(89 more attributes)
},
{
"ID":"000004"
"LocalityCode":"03"
(89 more attributes)
},
{
etc.
}
},
and so on
我只对“BuildingsList”分支下的部分数据感兴趣,因此我当前正在使用此 python 代码来查找所需的属性并将它们放入 csv 文件中:
import ijson
import csv
input_file_path = 'path to json file' output_file_path = 'path to csv file'
# List the fieldnames you want to include in the CSV file
desired_fieldnames = [ "ID" "localityCode" "Coordinates" "BuildingType" "RentalStatus" ]
# Buffer to store rows before writing to CSV
buffer_size = 1000000 chunky = 100000 rows_buffer = []
def write_buffer(writer, buffer): for row in buffer: writer.writerow(row)
with open(input_file_path, 'rb') as input_file, open(output_file_path, 'w', newline='', encoding='utf-8') as output_file: objects = ijson.items(input_file, "BuildingsList") writer = csv.DictWriter(output_file, fieldnames=desired_fieldnames) writer.writeheader()
for item in objects:
# Create a new dictionary with only the desired fields
for x in range (len(item)):
filtered_item = {field: item[x].get(field, '') for field in desired_fieldnames}
rows_buffer.append(filtered_item)
if len(rows_buffer) >= buffer_size:
write_buffer(writer, rows_buffer)
rows_buffer = []
# Write any remaining rows in the buffer
if rows_buffer:
write_buffer(writer, rows_buffer)
这在较小版本的数据集(60kb)上运行良好,但当我尝试在大文件上使用它时,它会导致我的计算机崩溃。我认为这是因为 ijson.items 尝试处理整个数据集。
我尝试使用 pandas 将 json 文件分割成可管理的块,但无法让它工作。
我根本不知道
ijson
,但如果它流读取json,这样你就不必读取整个有效负载,那么我可能希望它能工作。
import ijson
import csv
input_file_path = 'path to json file'
output_file_path = 'path to csv file'
desired_fieldnames = [ "ID" "localityCode" "Coordinates" "BuildingType" "RentalStatus" ]
with open(input_file_path, 'rb') as input_file, open(output_file_path, 'w', newline='', encoding='utf-8') as output_file:
writer = csv.DictWriter(output_file, fieldnames=desired_fieldnames, extrasaction="ignore")
writer.writeheader()
for item in ijson.items(input_file, "BuildingsList"):
writer.writerow(item)