网页抓取时 JSON 中不需要的换行符

问题描述 投票:0回答:1

我想使用 Scrapy 从这个网站提取信息。但我需要的信息在 JSON 文件中;并且此 JSON 文件仅在描述部分包含不需要的文字换行符。

这是一个示例页面,我想要抓取的 JSON 元素是这个

<script type="application/ld+json">
    {
      "@context": "http://schema.org",
      "@type": "Product",
            "description": "Hamster ve Guinea Pig için tasarlanmış temizliği kolay mama kabıdır.

Hamster motifleriyle süslü ve son derece sevimlidir.

Ürün seramikten yapılmıştır 

Ürün ölçüleri 


    Hacim: 100 ml
    Çap: 8 cm",
      "name": "Karlie Seramik Hamster ve Guinea Pigler İçin Yemlik 100ml 8cm",
      "image": "https://www.petlebi.com/up/ecommerce/product/lg_karlie-hamster-mama-kaplari-359657192.jpg",
      "brand": {
        "@type": "Brand",
        "name": "Karlie"
      },
      "category": "Guinea Pig Yemlikleri",
      "sku": "4016598440834",
      "gtin13": "4016598440834",
      "offers": {
        "@type": "Offer",
         "availability": "http://schema.org/InStock",
         "price": "149.00",
        "priceCurrency": "TRY",
        "itemCondition": "http://schema.org/NewCondition",
        "url": "https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html"
      },
      "review": [
            ]
    }
    </script>

如您所见,描述中存在文字换行符,这在 JSON 中是不允许的。这是我尝试过的代码,但没有成功:

import scrapy
import json
import re

class JsonSpider(scrapy.Spider):
    name = 'json_spider'
    start_urls = ['https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html']

    def parse(self, response):
        # Extract the script content containing the JSON data
        script_content = response.xpath('/html/body/script[12]').get()

        if not script_content:
            self.logger.warning("Script content not found.")
            return

        json_data_match = re.search(r'<script type="application/ld\+json">(.*?)<\/script>', script_content, re.DOTALL)
        if json_data_match:
            json_data_str = json_data_match.group(1)
            try:
                json_obj = json.loads(json_data_str)

                product_info = {
                    "name": json_obj.get("name"),
                    "description": json_obj.get("description"),
                    "image": json_obj.get("image"),
                    "brand": json_obj.get("brand", {}).get("name"),
                    "category": json_obj.get("category"),
                    "sku": json_obj.get("sku"),
                    "price": json_obj.get("offers", {}).get("price"),
                    "url": json_obj.get("offers", {}).get("url")
                }

                self.logger.info("Extracted Product Information: %s", product_info)

                with open('product_info.json', 'w', encoding='utf-8') as json_file:
                    json.dump(product_info, json_file, ensure_ascii=False, indent=2)

            except json.JSONDecodeError as e:
                self.logger.error("Error decoding JSON: %s", e)

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html',
            callback=self.parse,
        )

我希望这是一个动态代码,这样它适用于每个产品。

我使用 https://jsonlint.com/ 来查看不需要的字符,当我删除描述中的转义字符时,它表示它是有效的。我尝试了

html.unescape
但没有成功。代码在这一行停止工作:
json_obj = json.loads(json_data_str)
我该怎么办?

python json web-scraping scrapy web-crawler
1个回答
0
投票

只需从

char
文本中删除特定的
response
,然后再转换为
json object
,如下所示

json_data_str.replace("\n","").replace("\r","").replace("\t","")

或者您可以在

strict
函数上指定参数
json.loads

json.loads(json_data_str,strict=False)
© www.soinside.com 2019 - 2024. All rights reserved.