想象一下这些样本:
["And he said "Hello world" and went away"]
或者:
{"key": "value " with quote in the middle"}
或者
["invalid " string", "valid string"]
这些是无效的 json,很明显如何通过转义引号来修复它们。 我从有缺陷的系统(我无法控制也无法更改)获取这些 JSON,所以我必须在我这边修复它。 理论上它应该非常简单 - 如果您使用字符串并且找到引号字符,则它后面应该紧跟着:
,"
]
}
在所有其他情况下,引号可以被视为字符串的一部分并加引号。
在我开始实施之前。
最后我想出了这个。正则表达式还不够,因为我想保留 json 中当前发生的情况的上下文,以使其更具限制性。 可以通过使用整个堆栈并了解当前的数组/对象嵌套是什么来改进,但这对于我的用例来说足够简单。
是的,因为这只是修改字符串,如果我修复了原始源,我可以将其省略。
import json
import re
import unittest
from json import JSONDecodeError
from typing import Any
expected_characters_by_prestring_value = {
"[": (",", "]"),
"]": ("[", ","),
"{": (":",),
"}": (",", "{", "]"),
":": (",", "}"),
}
def fix_unescaped_quotes(raw: str) -> str:
in_string = False
output = ""
nesting_stack = []
for index, character in enumerate(raw):
if character == '"' and raw[index - 1] != "\\":
if in_string:
first_nonwhite_character_ahead = re.search(
r"\S", raw[index + 1:]
).group()
if first_nonwhite_character_ahead in expected_characters_by_prestring_value[
nesting_stack[-1]]: # (",", "]", "}", ":"):
in_string = False
else:
output += "\\"
else:
in_string = True
else:
if not in_string:
if character.strip() != "" and character not in (",",):
nesting_stack.append(character)
output += character
return output
def parse_and_fix(raw: str) -> Any:
try:
return json.loads(raw)
except JSONDecodeError:
return json.loads(fix_unescaped_quotes(raw=raw))
class JsonFixUnescapedQuotesTest(unittest.TestCase):
def test_completely_invalid(self):
with self.assertRaises(JSONDecodeError):
parse_and_fix("invalid_json")
def test_valid(self):
self.assertEqual({}, parse_and_fix("{}"))
def test_invalid_single_array(self):
self.assertEqual(
['he said "hello world" and left'],
parse_and_fix("""["he said "hello world" and left"]"""),
)
def test_invalid_object(self):
self.assertEqual(
{"key": 'value " with quote in the middle'},
parse_and_fix("""{"key": "value " with quote in the middle"}"""),
)
def test_invalid_2_item_array(self):
self.assertEqual(
['invalid " string', "valid string"],
parse_and_fix("""["invalid " string", "valid string"]"""),
)
def test_wont_get_fooled_by_colon(self):
self.assertEqual(
['invalid ": string', "valid string"],
parse_and_fix("""["invalid ": string", "valid string"]"""),
)
def test_wont_get_fooled_by_colon_after_object(self):
self.assertEqual(
{"key": "value\":"},
parse_and_fix("""{"key": "value":"}"""),
)
def test_wont_get_fooled_by_comma_in_key(self):
self.assertEqual(
{"key\",": "value"},
parse_and_fix("""{"key",": "value"}"""),
)
我们有时可能需要快速而肮脏的解决方案,我相信这个问题是完全合理的。在类似的情况下,我的 PHP 解决方案如下所示。我没有花时间将其转换为 Python,但请随意这样做。
function fixJsonQuotes($json) {
$insideString = false;
$newJson = '';
for ($i = 0; $i < mb_strlen($json); $i++) {
if (mb_substr($json, $i, 1) == '"' && mb_substr($json, $i - 1, 1) != '\\') {
// We've found an unescaped double quote character.
if (! $insideString && preg_match('~[\[\{,:]\s*"$~', mb_substr($json, 0, $i + 1))) {
$insideString = true;
} else if ($insideString && preg_match('~^"\s*[\]\},:]~', mb_substr($json, $i))) {
$insideString = false;
} else {
// We've found out that our double quote character must be escaped.
$newJson .= '\\';
}
}
$newJson .= mb_substr($json, $i, 1);
}
return $newJson;
}