Python 安全字符串压缩,无需换行符

问题描述 投票:0回答:1

我试图通过将 JSON 字符串作为新行附加到大型文本文件来保存大量 JSON 文件。我的存储空间有限,因此我不想按原样保存这么多文件的 JSON 字符串。相反,我尝试使用 zlib 库压缩 JSON 字符串,然后将压缩的字符串作为新行附加到大文件中。

压缩效果不错,但问题是压缩后的字符串经常会出现换行符” ",导致逐行读取时解压出错。 我尝试通过对zlib压缩字符串使用base64编码来克服这个问题,因为bas64没有换行符,但它会导致最终字符串更长,因此压缩效果较差(实际上对于较短的字符串,最终字符串zlib/base64 之后比原始字符串长)。

import zlib, base64
item_dict={}
item_dict["a"]="ما هذا الذي قاله اليوم بشأن الأخبارية التي فلتها متعمدا؟"
item_dict["b"]="She’s allowed to not want someone else’s kids in her picture. Y’all are weird for the way youre acting over this. I don’t want any pics of myself with my ex’s children, because they aren’t my children and I’m not in their lives anymore. It’s weird to post pics of someone else’s kids… so asking for them to be removed so I can still enjoy my picture from my holiday isn’t as bad as y’all are making it seem."
item_dict["c"]='''
{"symbol": "A/RES/74/1", "resolution_number": "74/1.", "title": "Scale of assessments for the apportionment of the expenses of the United Nations: requests under Article 19 of the Charter", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/483", "report_paragraph": "6", "committee": "Fifth Committee", "agenda_item": "Agenda item 139", "agenda_item_name": "Scale of assessments for the apportionment of the expenses of the United Nations", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE CHAIR OF THE COMMITTEE"], "additional_sponsors": [], "SDGs": [], "subjects": [["Comoros", "UNBIS Thesaurus"], ["Sao Tome And Principe", "UNBIS Thesaurus"], ["Somalia", "UNBIS Thesaurus"]]}
{"symbol": "A/RES/74/2", "resolution_number": "74/2.", "title": "Political declaration of the high-level meeting on universal health coverage", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/L.4", "report_paragraph": "N.A.", "committee": "Without reference to a Main Committee", "agenda_item": "Agenda item 126", "agenda_item_name": "Global health and foreign policy", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE PRESIDENT OF THE GENERAL ASSEMBLY"], "additional_sponsors": [], "SDGs": ["3"], "subjects": [["Health Policy", "UNBIS Thesaurus"], ["Public Health", "UNBIS Thesaurus"], ["Health Services", "UNBIS Thesaurus"], ["Health Insurance", "UNBIS Thesaurus"], ["Declarations (Text)", "UNBIS Thesaurus"]]}
{"symbol": "A/RES/74/3", "resolution_number": "74/3.", "title": "Political declaration of the high-level meeting to review progress made in addressing the priorities of small island developing States through the implementation of the SIDS Accelerated Modalities of Action (SAMOA) Pathway", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/L.3", "report_paragraph": "N.A.", "committee": "Without reference to a Main Committee", "agenda_item": "Agenda item 19 (b)", "agenda_item_name": "Sustainable development: follow-up to and implementation of the SIDS Accelerated Modalities of Action (SAMOA) Pathway and the Mauritius Strategy for the Further Implementation of the Programme of Action for the Sustainable Development of Small Island Developing States of the SIDS Accelerated Modalities of Action (SAMOA) Pathway and the Mauritius Strategy for the Further Implementation of the Programme of Action for the Sustainable Development of Small Island Developing States", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE PRESIDENT OF THE GENERAL ASSEMBLY"], "additional_sponsors": [], "SDGs": ["16", "17", "3"], "subjects": [["Sustainable Development", "UNBIS Thesaurus"], ["Developing Island Countries", "UNBIS Thesaurus"], ["Development Assistance", "UNBIS Thesaurus"], ["Programme Implementation", "UNBIS Thesaurus"], ["Programme Evaluation", "UNBIS Thesaurus"], ["Declarations (Text)", "UNBIS Thesaurus"]]}
'''
item_dict["d"]='{"url": "http://agribank.ngan-hang.net", "final_url": "http://ww7.ngan-hang.net/?usid=18&utid=23776691570", "lang": "", "title": "", "description": "", "keywords": "", "phone_numbers": [], "links": [], "social_links": [], "emails": [], "addresses": [], "logos": [], "text": "", "last": 41, "n_items": 1}'
for key,val in item_dict.items():
  zlib_compressed=zlib.compress(val.encode())
  base64_compressed=base64.b64encode(zlib_compressed)
  zlib_n_line_breaks=zlib_compressed.count(b'\n')
  base64_line_breaks=base64_compressed.count(b'\n')
  print("original size:",len(val)," | zlib:",len(zlib_compressed),"base64",len(base64_compressed),"| zlib_n_line_breaks",zlib_n_line_breaks,base64_line_breaks)

结果:

original size: 56  | zlib: 84 base64 112 | zlib_n_line_breaks 0 0
original size: 407  | zlib: 254 base64 340 | zlib_n_line_breaks 0 0
original size: 3655  | zlib: 941 base64 1256 | zlib_n_line_breaks 1 0
original size: 303  | zlib: 184 base64 248 | zlib_n_line_breaks 1 0

作为解决方法,我创建了一个自定义压缩/解压缩函数,它将压缩中的换行符替换为任意字符串(例如 00000),而在解压缩中则执行相反的操作。这减少了解压缩错误的可能性,但并没有消除它,因为原始压缩字符串可能会以某种方式具有此任意字符串。

我知道这个问题,但并不令人满意:

所以,这里的问题如下 - 是否有任何压缩算法可以压缩字符串而不产生换行符?或者有没有办法可靠地后处理 zlib 压缩/解压缩输出(或任何压缩算法的输出)以避免换行?

python compression zlib
1个回答
0
投票

用反斜杠和

n
替换每个新行,并用两个反斜杠替换每个反斜杠。在另一端,当您看到反斜杠时,请查看下一个字节。如果是
n
,则输出新行。否则输出反斜杠。

© www.soinside.com 2019 - 2024. All rights reserved.