使用 Scrapy Playwright 抓取网站时被阻止

问题描述 投票:0回答:1

我正在尝试抓取这个网站,但是当运行我的抓取工具时,该网站通过返回 405(有时是 403)HTTP 代码来阻止我,您可以在我的蜘蛛日志中看到它:

...
2023-12-06 10:18:38 [scrapy.core.engine] DEBUG: Crawled (405) <GET https://www.fotocasa.es/es/comprar/vivienda/avinyonet-de-puigventos/calefaccion-parking-piscina-television/179060067/d?from=pl> (referer: https://www.fotocasa.es/es/comprar/viviendas/particulares/espana/todas-las-zonas/pl) ['playwright']
2023-12-06 10:18:38 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.fotocasa.es/es/comprar/vivienda/avinyonet-de-puigventos/calefaccion-parking-piscina-television/179060067/d?from=pl>: HTTP status code is not handled or not allowed
...

我尝试使用随机标头与 scrapy-user-agents (使用默认的用户代理列表)以及通过

PLAYWRIGHT_LAUNCH_OPTIONS
设置传递给 Playwright 的代理,但该网站仍然将我检测为机器人。

这是我的蜘蛛版本,您可以使用它来复制我的问题:

from datetime import datetime
from time import sleep
from typing import Generator

from playwright.async_api import Page
from scrapy import Request, Spider
from scrapy.http import HtmlResponse


class FotocasaSpider(Spider):
    name = "fotocasa"
    allowed_domains = ["www.fotocasa.es"]
    base_url = 'https://www.fotocasa.es'
    start_urls = [
        base_url+'/es/comprar/viviendas/particulares/espana/todas-las-zonas/pl'
    ]

    operations = ('alquiler', 'comprar')
    categories = (
        'edificios', 'garajes', 'locales', 'oficinas', 'terrenos',
        'trasteros', 'viviendas'
    )
    stored_posts = []

    def start_requests(self) -> Generator[Request, None, None]:
        yield Request(
            url=self.start_urls[0], meta={
                'playwright': True,
                'playwright_include_page': True
            }
        )

    async def parse(self, response: HtmlResponse) -> Request | None:
        page: Page = response.meta['playwright_page']
        for _ in range(20):
            await page.mouse.wheel(0, 500)
            sleep(5)
        await page.screenshot(path=f'{datetime.now()}.png')
        await page.close()

        has_recommended = response.css('div.re-SearchResult-adjacentsTitle')
        if has_recommended:
            posts_xpath = '//div[@class="re-SearchResult-adjacentsTitle"]' \
                        '/preceding-sibling::article'
        else:
            posts_xpath = '//article'

        posts = response.xpath(posts_xpath)
        post_count = len(posts)

        self.log(f'Found {post_count} posts')

        if post_count <= 0:
            self.log(
                f'Could not get any posts from HTML code: {response.body}'
            )

            return None

        post_links = [post.css('a::attr(href)').get() for post in posts]

        url: str = ''
        while True:
            post_link = post_links.pop(0)

            if post_link not in self.stored_posts:
                url = f'{self.base_url}{post_link}'
                break

        self.log(f'Going to first post: {url}')

        return Request(
            url=url, callback=self.parse_data, meta={'playwright': True},
            cb_kwargs={'post_links': post_links}
        )

    def parse_data(
            self, response: HtmlResponse, *,
            post_links: list[str]
            ) -> Generator[dict | Request, None, None]:
        data = {}

        data['zone'] = response.css('h2.re-DetailMap-address::text').get()
        data['category'] = (
            response
            .xpath('//div[@class="re-DetailFeaturesList-feature"][1]//p[2]/text()')
            .get()
        )
        data['title'] = response.css('h1.re-DetailHeader-propertyTitle::text').get()
        data['price'] = response.css('span.re-DetailHeader-price::text').get()
        data['meters'] = (
            response
            .xpath('//ul[@class="re-DetailHeader-features"]/li[3]/span[2]/span/text()')
            .get()
        )
        data['bathrooms'] = (
            response
            .xpath('//ul[@class="re-DetailHeader-features"]/li[2]/span[2]/span/text()')
            .get()
        )
        data['rooms'] = (
            response
            .xpath('//ul[@class="re-DetailHeader-features"]/li[1]/span[2]/span/text()')
            .get()
        )

        yield data

        if len(post_links) > 0:
            next_post = f'{self.base_url}{post_links.pop(0)}'
            self.log(f'Going to next post: {next_post}')
            yield Request(
                url=next_post, callback=self.parse_data, meta={'playwright': True},
                cb_kwargs={'post_links': post_links}
            )

这是 Scrapy 设置:

BOT_NAME = "Fotocasa"
SPIDER_MODULES = ["Fotocasa.spiders"]
NEWSPIDER_MODULE = "Fotocasa.spiders"
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 4
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': True,
    'timeout': 90000,
    'args': ['--disable-gpu'],
    'proxy': {
        'server': 'http://rp.proxyscrape.com:6060',
        'username': 'f0us73qx0z3gnni-country-es',
        'password': '2dgdpl3zgp8jqax',
    }
}
PLAYWRIGHT_MAX_CONTEXTS = 2
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 120 * 1000
LOG_FILE = 'logs/fotocasa.log'
LOG_FILE_APPEND = False
CONCURRENT_REQUESTS = 1

如您所见,我正在尝试从该网站获取帖子数据,但它一直阻止我,我没有主意了

web-scraping scrapy playwright python-3.10
1个回答
0
投票

我是来自 ProxyScrape 的 Thibeau Maerevoet。我在互联网上搜索并注意到你的帖子。不幸的是,我无法立即回答您的问题,但自从我注意到您在这里分享了您的代理凭据后,我已暂时关闭您的住宅代理帐户。

请注意安全。

亲切的问候, 蒂博 M.

© www.soinside.com 2019 - 2024. All rights reserved.