在页面上定位脚本标签

Question

我需要一些帮助来在我正在抓取的网站 otto.de 上定位脚本标签。我可以使用 xpath helper 找到 xpath，但是当我在代码中使用它时，它不返回任何内容。下面是特定的脚本标签，其中包含我需要的所有信息

<script id="product_data_json" type="application/ld+json">{"@context": "https://schema.org/", "@type": "Product", "gtin13": "4061519667180", "sku": "27443719", "name": "SOCCX Rundhalspullover mit längerer Rückenpartie", "description": "Ein absoluter Wohlfühlpulli ist dieser oversized geschnittene Pullover von SOCCX. Seine angeraute Oberfläche sorgt für eine mega weiche Haptik und der Logo Print mit Glitter für einen Hauch Glamour im Alltags-Outfit. Er ist perfekt für den Moment – schnell mal etwas gemütliches Überziehen!", "image": ["https://i.otto.de/i/otto/4f835f0b-ed8c-5159-a56c-a22acd7980fb/soccx-rundhalspullover-mit-laengerer-rueckenpartie.jpg?$formatz$", "https://i.otto.de/i/otto/733b0bf0-3e1e-5a45-931e-16950bca7653/soccx-rundhalspullover-mit-laengerer-rueckenpartie.jpg?$formatz$"], "aggregateRating": {"@type": "AggregateRating", "ratingValue": "4", "reviewCount": "3"}, "offers": {"@type": "Offer", "url": "https://www.otto.de/p/soccx-rundhalspullover-mit-laengerer-rueckenpartie-1759049837/", "priceCurrency": "EUR", "price": "39.95", "itemCondition": "http://schema.org/NewCondition"}, "brand": {"@type": "Brand", "name": "SOCCX"}}</script>

我尝试使用xpath，

script_tag = response.xpath("//script[@id='product_data_json']/text()").extract_first()

                    if script_tag:
                        print(script_tag)
                        soup = BeautifulSoup(script_tag, 'html.parser')

                        # Parse the JSON 
                        product_data = json.loads(soup.string)

                        # Extract the image URL
                        product_url = product_data['offers']['url']
                        print("Product URL:", product_url)
                    else:
                        print("Script tag not found or JSON data missing.")

但这不会返回任何内容。我也尝试过， script_tag = response.css('script#product_data_json::text').get() 但这也没有成功。

谁能给我解释一下这是怎么回事？我是 Scrapy 新手，这是我的第一个大项目。

Answer 1

我检查了没有 Javascript 的页面，因此内容没有动态加载。因此，默认的用户代理可能会被站点阻止。

class MySpider(scrapy.Spider):
   name = 'myspider'
   start_urls = ['http://otto.de']
   custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'}

   def parse(self, response):
...

在页面上定位脚本标签

问题描述投票：0回答：1

1个回答

最新问题

在页面上定位脚本标签

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1