在抓取网站时,我想获取指向 404 的引用。
def parse_item(self, response):
if response.status == 404:
Do something with this > referer=response.request.headers.get('Referer', None)
这是可行的,但返回的引用总是类似于:
\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c
这看起来更像是一个内存地址,而不是一个 URL。我在这里错过了什么吗?
谢谢你!
布鲁诺
前导
\x
转义序列意味着接下来的两个字符被解释为字符代码的十六进制数字。(前导 \x
在 Python 字符串中意味着什么 \xaa
)
\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c
本例中只有一个
\x
,但后面的仍然是一个十六进制字符串。
您可以对其进行解码并获取 URL。 XD
>>> # \x need to be remove from the string
>>> str = '68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c'
>>> bytes.fromhex(str)
b'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
>>> bytes.fromhex(str).decode('utf-8')
'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
谢谢彦慧。你解锁了我:
比想象的更简单:
def parse_item(self, response):
if response.status == 404:
Do something with this > referer=response.request.headers.get('Referer', None).decode('utf-8')
referer_str
来处理此问题以进行日志记录。您也可以根据您的情况使用它。
# Python 3.11.7
# Scrapy 2.11.1
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.utils.request import referer_str
class ToySpider(CrawlSpider):
name: str = "toy"
start_urls: list[str] = ["https://quotes.toscrape.com/"]
# Enable action when response status is 308. See below for details.
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html#module-scrapy.spidermiddlewares.httperror
handle_httpstatus_list = [308]
rules: list[Rule] = [
Rule(
link_extractor=LinkExtractor(),
callback="parse_item",
# Don't follow links after those on the ``start_urls``.
# This keeps the example small.
follow=False,
)
]
@staticmethod
def parse_item(response: HtmlResponse) -> dict[str, str | int]:
"""Return the referer and requested URLs."""
# Using the referer_str here!!!
referer = referer_str(request=response.request)
if response.status == 308:
yield {
"referer": referer,
"response_url": response.url,
"status": response.status,
}
输出示例:
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Thomas-A-Edison', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Eleanor-Roosevelt', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Steve-Martin', 'status': 308}
referer_str
的使用示例: