Scrapy FormRequest 用于复杂的有效负载

问题描述 投票:0回答:2

在一个包含律师工作详细信息的网站中,我尝试通过这个 4 层算法来抓取信息,其中我需要执行两个 FormRequest:

  1. 访问包含提交律师姓名请求的搜索框的链接(图片1)(“ali”作为姓名查询传递)
  2. 通过FormRequest携带payload发起搜索请求,从而访问找到律师的页面(图2)
  3. 连续点击放大镜按钮,通过 FormRequest 到达包含每位律师详细信息的页面(图 3)(此处出现错误)
  4. 解析图片3中每个律师的数据点

问题:我的第一个 FormRequest 工作正常,我可以联系到律师名单。然后我遇到两个问题:

  1. 问题1:我的for循环仅适用于找到的第一个律师。
  2. 问题2:第二个FormRequest 不起作用。

我的见解:检查每个请求的律师的第二个 FormRequest 所需的有效负载,将所有值批量添加到有效负载以及请求的律师的索引号中。

我真的应该传递每个请求的所有值吗?如何发送正确的有效负载? 在我的代码中,我尝试将特定律师的值和索引作为有效负载发送,但它不起作用。 我应该使用什么样的代码来获取列表中所有律师的详细信息?

import scrapy
import json
import logging
from scrapy.utils.response import open_in_browser


class BarolevSpider(scrapy.Spider):
    name = 'barolev'
    allowed_domains = ['www.barobirlik.org.tr']
    start_urls = ['https://www.barobirlik.org.tr/levhaiframe/342']

    headers = {
            'authority': 'www.barobirlik.org.tr',
            'path': '/Islemler/LevhaIFrame',
            'scheme': 'https',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'en-US,en;q=0.9',
            'cache-control': 'max-age=0',
            'content-type': 'application/x-www-form-urlencoded',
            'origin': 'https://www.barobirlik.org.tr',
            'referer': 'https://www.barobirlik.org.tr/levhaiframe/342',
            'sec-fetch-dest': 'iframe',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
                }
    url = 'https://www.barobirlik.org.tr/Islemler/LevhaIFrame'

    def parse(self, response):

        #search inquiry body
        body = {
            'ad': 'ali',
            'soyad': '',
            'baroId': '342',
            'tip': 'Avukat',
                }
        logging.log(logging.INFO, "Search page opened")
        search = scrapy.FormRequest(self.url, callback = self.list_parser, method = "POST", headers=self.headers, formdata = body)

        yield search

    def list_parser(self, response):

        #open_in_browser(response)
        logging.log(logging.INFO, "Lawyers list's page opened")

        data = response.xpath("//div/table/tbody")
        lawyer_values = response.xpath("//input[@value]/text()")

        for counter, lawyer in enumerate(data):
            lic_number= response.xpath("//tbody/tr/td[3]/text()").get() #lawyer's licence number
            name = response.xpath(".//tbody/tr/td[4]/text()").get()
            surname = response.xpath("//tbody/tr/td[5]/text()").get()
            value = response.xpath(".//tbody/tr/td/input/@value").get()
            payload = {
                        counter : value,
                        'detay' : counter
                        }


            yield scrapy.FormRequest(self.url, callback= self.lawyer_parser, method = "POST", headers =self.headers, formdata = payload )




    def lawyer_parser(self, response):
        open_in_browser(response)
        pass

临时日志:

2021-12-27 02:19:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: baro2)
2021-12-27 02:19:23 [scrapy.utils.log] INFO: Versions: lxml 4.6.4.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.9 (main, Nov 16 2021, 10:24:31) - [GCC 11.2.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Linux-5.14.0-kali4-amd64-x86_64-with-glibc2.32
2021-12-27 02:19:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-12-27 02:19:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'baro2',
 'NEWSPIDER_MODULE': 'baro2.spiders',
 'SPIDER_MODULES': ['baro2.spiders']}
2021-12-27 02:19:23 [scrapy.extensions.telnet] INFO: Telnet Password: 05273ea1c26c6378
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-27 02:19:23 [scrapy.core.engine] INFO: Spider opened
2021-12-27 02:19:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 02:19:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-27 02:19:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barobirlik.org.tr/levhaiframe/342> (referer: None)
2021-12-27 02:19:24 [root] INFO: Search page opened
2021-12-27 02:19:26 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.barobirlik.org.tr/levhaiframe/342> (referer: https://www.barobirlik.org.tr/levhaiframe/342)
2021-12-27 02:19:26 [root] INFO: Lawyers list's page opened
2021-12-27 02:19:26 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.barobirlik.org.tr/levhaiframe/342> (referer: https://www.barobirlik.org.tr/levhaiframe/342)
Traceback (most recent call last):
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/scraping/scrapyyy/baro2/baro2/spiders/barolev.py", line 61, in list_parser
    yield scrapy.FormRequest(self.url, callback= self.lawyer_parser, method = "POST", headers =self.headers, formdata = payload )
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 106, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got int
2021-12-27 02:19:26 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-27 02:19:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 943,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 8193,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.788149,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 26, 22, 19, 26, 603568),
 'httpcompression/response_bytes': 77405,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 12,
 'memusage/max': 58245120,
 'memusage/startup': 58245120,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2021, 12, 26, 22, 19, 23, 815419)}
2021-12-27 02:19:26 [scrapy.core.engine] INFO: Spider closed (finished)

提前致谢。

python scrapy payload scrapy-request
2个回答
0
投票

该网站使用某种保护,此代码有时会起作用,一旦检测到,您将不得不等待一段时间,直到他们的反机器人清除内容或使用代理:

导入此:

from re import sub

主要代码如下:

def list_parser(self, response):
    logging.log(logging.INFO, "Lawyers list's page opened")
    post_dict = []
    for x in response.xpath("//div[@class='table-responsive']/table//tbody/tr"):
        counter_val = x.xpath("./td[1]/button[@name='detay']/@value").get()
        profile_id = x.xpath(f"./td[1]/input[@name='{counter_val}']/@value").get()
        post_dict.append({counter_val: profile_id})

    post_str = "detay=1"
    for x in post_dict:
        _k = list(x.keys())[0]
        _v = list(x.values())[0]
        post_str += f"&{_k}={_v}"

    counter = 0
    for x in post_dict:
        counter += 1
        new_post_str = sub(r"detay=(\d+)", f"detay={counter}", post_str)
        yield scrapy.Request(self.url, callback=self.lawyer_parser, method="POST", headers=self.headers, body=new_post_str)

def lawyer_parser(self, response):
    title = response.xpath("//title").get()
    print (title)

邮递员的工作原理如下:

POST https://www.barobirlik.org.tr/Islemler/LevhaIFrame
content-type: application/x-www-form-urlencoded
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

detay=1&1=22300955972&2=34294103512&3=47650810900&4=31457011912&5=21989082740&6=13031762182&7=15098226362&8=29465540500&9=19705078726&10=52327454912&11=38077312146&12=15053874294&13=43504737980&14=15757111312&15=49528751810&16=22454512274&17=20735008140&18=14759958840&19=25331632218&20=23717241030&21=26188728710&22=22144254712&23=40831499468&24=45721682982&25=17524950834&26=20443640946&27=43375067858&28=25958522862&29=55642518444&30=15958205476&31=67798055186&32=53125619114&33=19700638734&34=47962570370&35=61324031392&36=62770087018&37=42020121822&38=51916778712&39=63445269098&40=20968442094&41=37150116442&42=19825577734&43=34690100306&44=61117020312&45=47005726450&46=32674457828&47=28318970648&48=46675921516&49=43891755192&50=63547258662&51=14786321440&52=49882225538&53=15026163486&54=71761039790&55=19769136608&56=13592195252&57=26261321722&58=67759140784&59=15443072422&60=14358051782&61=35387076706&62=31600743554&63=29132448610&64=50134148826&65=13741738194&66=46951136320&67=48070816756&68=40531941366&69=25712046412&70=50041757234&71=34792727562&72=31373015384&73=19936276632&74=11726671214&75=50752736252&76=10825141642&77=41269955922&78=22703604262&79=16439033762&80=38872658142&81=22313706936&82=20122397994&83=22807774748&84=25522741996&85=23408214992&86=12125473522&87=34900898696&88=51511402166&89=57145071426&90=27010985158&91=43045344210&92=15517773668&93=27065110338

0
投票

我不再使用

FormRequest
,而是改用准系统
Request

import headers
headers = {
    # some headers
}
payload = json.dumps(formdata)
Request(
    url,
    headers=headers,
    method='POST',
    body=payload,
)
© www.soinside.com 2019 - 2024. All rights reserved.