我目前正在使用BeautifulSoup从工作网站上抓取列表中的内容,并通过网站的HTML代码将数据输出到JSON中。
我使用正则表达式修复了一些错误,但是我被这个问题困扰了。本质上,我正在将BeautifulSoup结果转换为字符串,清除HTML残留的数据,然后将字符串转换为JSON。但是,由于值包含包含引号的文本,因此我遇到了麻烦。由于实际数据很大,因此我将使用替代项。
example_string = '{"Category_A" : "Words typed describing stuff",
"Category_B" : "Other words speaking more irrelevant stuff",
"Category_X" : "Here is where the "PROBLEM" lies"}'
现在,以上代码无法在Python中运行,但是我从工作清单的HTML中提取的字符串与上述格式相当。当它传递到json.loads()
时,它返回错误:json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035
我完全不确定如何解决此问题。
EDIT 这是导致错误的实际代码:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re
uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()
listing_soup = BeautifulSoup(page_html, "lxml")
extracted_json_str = ''.join(json_script)
## Clean up the string some
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+| | |amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
repl='',
string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
repl = r"'",
string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
repl=r" -",
string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
repl="",
string = extracted_json_str_CLEAN3)
## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)
如果您想在值中使用双引号(“),则需要应用转义序列(\)。因此,您对json.loads()的String输入应如下所示。
example_string = '{"Category_A": "Words typed describing stuff", "Category_B": "Other words speaking more irrelevant stuff", "Category_X": "Here is where the \\"PROBLEM\\" lies"}'
json.loads可以对此进行解析。
# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
extraction = extraction.replace("\"", "\'")
print(extraction)
在这种情况下,您将转换“从提取中”,这意味着您将需要转换某些东西,因为如果您要在字符串中使用“”,python会给您提供一种同时使用两者的方式,您需要将辛博尔求逆:
示例:
"this is a 'test'"
'this was a "test"'
"this is not a \"test\""
#in case the condition is meat
if "\"" in item:
#use this
item = item.replace("\"", "\'")
#or use this
item = item.replace("\"", "\\\"")