你好,我有两个 jsonl
这样的文件。
one.jsonl
{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}
second.jsonl
{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}
而我的目标是写一个新的 jsonl
文件名 merged_file.jsonl
它的样子是这样的。
{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}
{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}
我的方法是这样的
import json
import glob
result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
with open(f, 'r', encoding='utf-8-sig') as infile:
try:
result.append(extract_json(infile)) #tried json.loads(infile) too
except ValueError:
print(f)
#write the file in BOM TO preserve the emojis and special characters
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
json.dump(result, outfile)
但是我遇到了这个错误:TypeError: Object of type generator is not JSON serializable
我将感激你的提示,以任何方式帮助。谢谢你!我看了其他的SO repos,他们都在写正常的json文件。我看了其他的SO repos,他们都是写普通的json文件,在我的情况下应该也能用,但它一直失败。
读取单个文件像这样的工作。
data_json = io.open('one.jsonl', mode='r', encoding='utf-8-sig') # Opens in the JSONL file
data_python = extract_json(data_json)
for line in data_python:
print(line)
####outputs####
#{'name': 'one', 'description': 'testDescription...', 'comment': '1'}
#{'name': 'two', 'description': 'testDescription2...', 'comment': '2'}
有可能extract_json返回的是一个生成器而不是listdict,因为它是jsonl,所以是json可序列化的,也就是说每一行都是有效的json,所以你只需要调整一下你现有的代码。
import json
import glob
result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
with open(f, 'r', encoding='utf-8-sig') as infile:
for line in infile.readlines():
try:
result.append(json.loads(line)) # read each line of the file
except ValueError:
print(f)
# This would output jsonl
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
#json.dump(result, outfile)
#write each line as a json
outfile.write("\n".join(map(json.dumps, result)))
现在我想起来了,你甚至不需要用json来加载它,除了它能帮助你消毒任何格式不好的json行之外,就是所有的了
你可以像这样把所有的线都集中在一个镜头里。
outfile = open('merged_file.jsonl','w', encoding= 'utf-8-sig')
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
with open(f, 'r', encoding='utf-8-sig') as infile:
for line in infile.readlines():
outfile.write(line)
outfile.close()
如果你不在乎json验证,还有一个超级简单的方法可以做到这一点
cat folder_with_all_jsonl/*.jsonl > merged_file.jsonl