我想使用 scrapy 从此网站提取信息。但我需要的信息在 JSON 文件中;并且此 JSON 文件仅在描述部分包含不需要的转义字符。 这是一个示例页面,我想要抓取的 JSON 元素是这个
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Product",
"description": "Hamster ve Guinea Pig için tasarlanmış temizliği kolay mama kabıdır.
Hamster motifleriyle süslü ve son derece sevimlidir.
Ürün seramikten yapılmıştır
Ürün ölçüleri
Hacim: 100 ml
Çap: 8 cm",
"name": "Karlie Seramik Hamster ve Guinea Pigler İçin Yemlik 100ml 8cm",
"image": "https://www.petlebi.com/up/ecommerce/product/lg_karlie-hamster-mama-kaplari-359657192.jpg",
"brand": {
"@type": "Brand",
"name": "Karlie"
},
"category": "Guinea Pig Yemlikleri",
"sku": "4016598440834",
"gtin13": "4016598440834",
"offers": {
"@type": "Offer",
"availability": "http://schema.org/InStock",
"price": "149.00",
"priceCurrency": "TRY",
"itemCondition": "http://schema.org/NewCondition",
"url": "https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html"
},
"review": [
]
}
</script>
如你所见,有 描述部分中的字符。这是我尝试过的代码,但没有成功:
import scrapy
import json
import re
class JsonSpider(scrapy.Spider):
name = 'json_spider'
start_urls = ['https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html']
def parse(self, response):
# Extract the script content containing the JSON data
script_content = response.xpath('/html/body/script[12]').get()
if not script_content:
self.logger.warning("Script content not found.")
return
json_data_match = re.search(r'<script type="application/ld\+json">(.*?)<\/script>', script_content, re.DOTALL)
if json_data_match:
json_data_str = json_data_match.group(1)
try:
json_obj = json.loads(json_data_str)
product_info = {
"name": json_obj.get("name"),
"description": json_obj.get("description"),
"image": json_obj.get("image"),
"brand": json_obj.get("brand", {}).get("name"),
"category": json_obj.get("category"),
"sku": json_obj.get("sku"),
"price": json_obj.get("offers", {}).get("price"),
"url": json_obj.get("offers", {}).get("url")
}
self.logger.info("Extracted Product Information: %s", product_info)
with open('product_info.json', 'w', encoding='utf-8') as json_file:
json.dump(product_info, json_file, ensure_ascii=False, indent=2)
except json.JSONDecodeError as e:
self.logger.error("Error decoding JSON: %s", e)
def start_requests(self):
yield scrapy.Request(
url='https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html',
callback=self.parse,
)
我希望这是一个动态代码,这样它适用于每个产品。
我使用 https://jsonlint.com/ 来查看不需要的字符,当我删除描述中的转义字符时,它表示它是有效的。我尝试了
html.unescape
但没有成功。代码在这一行停止工作:
json_obj = json.loads(json_data_str)
我该怎么办?
请发布实际的 json 响应,以便其他开发人员可以复制并查看它。 同时,单独对待描述键,使用python的split(" ") 方法。这将创建一个对象数组,忽略 " "。然后使用 .join() 方法连接列表。