我正在尝试抓取这个网站,但是当运行我的抓取工具时,该网站通过返回 405(有时是 403)HTTP 代码来阻止我,您可以在我的蜘蛛日志中看到它:
...
2023-12-06 10:18:38 [scrapy.core.engine] DEBUG: Crawled (405) <GET https://www.fotocasa.es/es/comprar/vivienda/avinyonet-de-puigventos/calefaccion-parking-piscina-television/179060067/d?from=pl> (referer: https://www.fotocasa.es/es/comprar/viviendas/particulares/espana/todas-las-zonas/pl) ['playwright']
2023-12-06 10:18:38 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.fotocasa.es/es/comprar/vivienda/avinyonet-de-puigventos/calefaccion-parking-piscina-television/179060067/d?from=pl>: HTTP status code is not handled or not allowed
...
我尝试使用随机标头与 scrapy-user-agents (使用默认的用户代理列表)以及通过
PLAYWRIGHT_LAUNCH_OPTIONS
设置传递给 Playwright 的代理,但该网站仍然将我检测为机器人。
这是我的蜘蛛版本,您可以使用它来复制我的问题:
from datetime import datetime
from time import sleep
from typing import Generator
from playwright.async_api import Page
from scrapy import Request, Spider
from scrapy.http import HtmlResponse
class FotocasaSpider(Spider):
name = "fotocasa"
allowed_domains = ["www.fotocasa.es"]
base_url = 'https://www.fotocasa.es'
start_urls = [
base_url+'/es/comprar/viviendas/particulares/espana/todas-las-zonas/pl'
]
operations = ('alquiler', 'comprar')
categories = (
'edificios', 'garajes', 'locales', 'oficinas', 'terrenos',
'trasteros', 'viviendas'
)
stored_posts = []
def start_requests(self) -> Generator[Request, None, None]:
yield Request(
url=self.start_urls[0], meta={
'playwright': True,
'playwright_include_page': True
}
)
async def parse(self, response: HtmlResponse) -> Request | None:
page: Page = response.meta['playwright_page']
for _ in range(20):
await page.mouse.wheel(0, 500)
sleep(5)
await page.screenshot(path=f'{datetime.now()}.png')
await page.close()
has_recommended = response.css('div.re-SearchResult-adjacentsTitle')
if has_recommended:
posts_xpath = '//div[@class="re-SearchResult-adjacentsTitle"]' \
'/preceding-sibling::article'
else:
posts_xpath = '//article'
posts = response.xpath(posts_xpath)
post_count = len(posts)
self.log(f'Found {post_count} posts')
if post_count <= 0:
self.log(
f'Could not get any posts from HTML code: {response.body}'
)
return None
post_links = [post.css('a::attr(href)').get() for post in posts]
url: str = ''
while True:
post_link = post_links.pop(0)
if post_link not in self.stored_posts:
url = f'{self.base_url}{post_link}'
break
self.log(f'Going to first post: {url}')
return Request(
url=url, callback=self.parse_data, meta={'playwright': True},
cb_kwargs={'post_links': post_links}
)
def parse_data(
self, response: HtmlResponse, *,
post_links: list[str]
) -> Generator[dict | Request, None, None]:
data = {}
data['zone'] = response.css('h2.re-DetailMap-address::text').get()
data['category'] = (
response
.xpath('//div[@class="re-DetailFeaturesList-feature"][1]//p[2]/text()')
.get()
)
data['title'] = response.css('h1.re-DetailHeader-propertyTitle::text').get()
data['price'] = response.css('span.re-DetailHeader-price::text').get()
data['meters'] = (
response
.xpath('//ul[@class="re-DetailHeader-features"]/li[3]/span[2]/span/text()')
.get()
)
data['bathrooms'] = (
response
.xpath('//ul[@class="re-DetailHeader-features"]/li[2]/span[2]/span/text()')
.get()
)
data['rooms'] = (
response
.xpath('//ul[@class="re-DetailHeader-features"]/li[1]/span[2]/span/text()')
.get()
)
yield data
if len(post_links) > 0:
next_post = f'{self.base_url}{post_links.pop(0)}'
self.log(f'Going to next post: {next_post}')
yield Request(
url=next_post, callback=self.parse_data, meta={'playwright': True},
cb_kwargs={'post_links': post_links}
)
这是 Scrapy 设置:
BOT_NAME = "Fotocasa"
SPIDER_MODULES = ["Fotocasa.spiders"]
NEWSPIDER_MODULE = "Fotocasa.spiders"
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 4
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': True,
'timeout': 90000,
'args': ['--disable-gpu'],
'proxy': {
'server': 'http://rp.proxyscrape.com:6060',
'username': 'f0us73qx0z3gnni-country-es',
'password': '2dgdpl3zgp8jqax',
}
}
PLAYWRIGHT_MAX_CONTEXTS = 2
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 120 * 1000
LOG_FILE = 'logs/fotocasa.log'
LOG_FILE_APPEND = False
CONCURRENT_REQUESTS = 1
如您所见,我正在尝试从该网站获取帖子数据,但它一直阻止我,我没有主意了
我是来自 ProxyScrape 的 Thibeau Maerevoet。我在互联网上搜索并注意到你的帖子。不幸的是,我无法立即回答您的问题,但自从我注意到您在这里分享了您的代理凭据后,我已暂时关闭您的住宅代理帐户。
请注意安全。
亲切的问候, 蒂博 M.