Scrapy 引荐来源网址未返回可读网址

问题描述 投票:0回答:3

在抓取网站时,我想获取指向 404 的引用。

def parse_item(self, response):

    if response.status == 404:
        Do something with this > referer=response.request.headers.get('Referer', None)

这是可行的,但返回的引用总是类似于:

\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c

这看起来更像是一个内存地址,而不是一个 URL。我在这里错过了什么吗?

谢谢你!

布鲁诺

python-3.x scrapy
3个回答
1
投票

前导

\x
转义序列意味着接下来的两个字符被解释为字符代码的十六进制数字。(前导
\x
在 Python 字符串中意味着什么
\xaa

\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c

本例中只有一个

\x
,但后面的仍然是一个十六进制字符串。 您可以对其进行解码并获取 URL。 XD

>>> # \x need to be remove from the string
>>> str = '68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c'
>>> bytes.fromhex(str)
b'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
>>> bytes.fromhex(str).decode('utf-8')
'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'

1
投票

谢谢彦慧。你解锁了我:

比想象的更简单:

def parse_item(self, response):

    if response.status == 404:
        Do something with this > referer=response.request.headers.get('Referer', None).decode('utf-8')

0
投票

Scrapy 有函数

referer_str
来处理此问题以进行日志记录。您也可以根据您的情况使用它。


MRE

from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.utils.request import referer_str


class ToySpider(CrawlSpider):
    name: str = "toy"
    start_urls: list[str] = ["https://quotes.toscrape.com/"]
    # Enable action when response status is 308. See below for details.
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html#module-scrapy.spidermiddlewares.httperror
    handle_httpstatus_list = [308]

    rules: list[Rule] = [
        Rule(
            link_extractor=LinkExtractor(),
            callback="parse_item",
            # Don't follow links after those on the ``start_urls``.
            # This keeps the example small.
            follow=False,
        )
    ]

    @staticmethod
    def parse_item(response: HtmlResponse) -> dict[str, str | int]:
        """Return the referer and requested URLs."""
        referer = referer_str(request=response.request)
        if response.status == 308:
            yield {
                "referer": referer,
                "response_url": response.url,
                "status": response.status,
            }

输出示例:

{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Thomas-A-Edison', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Eleanor-Roosevelt', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Steve-Martin', 'status': 308}

scrapy源代码中
referer_str
的使用示例:

© www.soinside.com 2019 - 2024. All rights reserved.