使用scrapy提取<li>和<ul>

问题描述 投票:0回答:2

我是 Scrapy 的新手,但我遇到了一个问题,即根据 scrapy 的教程代码形成准确的选择器,基本上我正在尝试列出所有业务、他们的地址和网站。但是当我尝试列出它们时,只出现一个结果(如果我将它们全部设置为 getall 那么我会得到所有它们,只是它们被随机扔在那里,我需要它们的格式:

{"address": "mazowieckie, Warszawa", "name": "Dom Development S.A.", "link": "domd.pl"}```)

这是我使用的代码:

import scrapy


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1',
    ]

    def parse(self, response):
        for quote in response.css('ul.rp-1qtpzi4'):
            yield {
                'address': quote.css('address.rp-o9b83y::text').get(),
                'name': quote.css('h2.rp-69f2r4::text').get(),
                'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
            }

提前致谢。

python-3.x scrapy
2个回答
0
投票

您只得到一个输出,因为元素选择/定位器策略

ul.rp-1qtpzi4
不正确,这意味着它没有选择整个页面中的所有列表,而是选择正确的选择,例如
.rp-y89gny.eboilu01 ul li
选择全部 24 项

import scrapy
from scrapy.crawler import CrawlerProcess

class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']

    def parse(self, response):
        for quote in response.css('.rp-y89gny.eboilu01 ul li'):
            yield {
                'address': quote.css('address.rp-o9b83y::text').get(),
                'name': quote.css('h2.rp-69f2r4::text').get(),
                'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
            }

    
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

输出:

{'address': 'mazowieckie, Warszawa', 'name': 'Dom Development S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/dom-development-sa-955/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Ronson Development Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/ronson-development-sp-z-oo-863/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Echo Investment S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/echo-investment-sa-7478/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Psie Pole', 'name': 'INTER-ES Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/inter-es-deweloper-928/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, Bielsko-Biała', 'name': 'Murapol S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/murapol-sa-884/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Robyg S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-sa-888/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, cieszyński, Cieszyn', 'name': 'ATAL S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/atal-sa-1084/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'zachodniopomorskie, Szczecin', 'name': 'Assethome – Przedstawiciel Dewelopera', 'link': 'https://rynekpierwotny.pl/deweloperzy/asset-home-przedstawiciel-dewelopera-7429/'}    
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Hreit', 'link': 'https://rynekpierwotny.pl/deweloperzy/hreit-7892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Develia', 'link': 'https://rynekpierwotny.pl/deweloperzy/develia-1048/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Fabryczna', 'name': 'PROFIT Development', 'link': 'https://rynekpierwotny.pl/deweloperzy/profit-development-940/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Novisa Development Sp. z o.o. Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/novisa-development-sp-z-oo-sp-j-484/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Robyg', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-grupa-deweloperska-4251/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Arche S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/arche-sp-z-oo-934/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'warmińsko-mazurskie, ełcki, Ełk', 'name': 'Rutkowski Development Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/rutkowski-development-sp-j-1846/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Cordia Polska', 'link': 'https://rynekpierwotny.pl/deweloperzy/cordia-polska-3824/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Budlex Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/budlex-sp-z-oo-1684/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Euro Styl S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/euro-styl-sa-964/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'łódzkie, Skierniewice', 'name': 'JHM DEVELOPMENT S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/jhm-development-sa-892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Lokum Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/lokum-deweloper-948/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'podlaskie, Łomża', 'name': 'Eldor Bud Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/eldor-bud-sp-z-oo-4355/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Nexity Polska Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/nexity-polska-sp-z-oo-2856/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Spravia', 'link': 'https://rynekpierwotny.pl/deweloperzy/spravia-1236/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'małopolskie, Kraków', 'name': 'Bryksy', 'link': 'https://rynekpierwotny.pl/deweloperzy/bryksy-914/'}

 'item_scraped_count': 24,,

0
投票

response.css('ul.rp-1qtpzi4')
将为您提供物品的容器,而不是物品(li 标签)本身。因此,您循环遍历容器(一次)并仅获取第一项。

更改为:

import scrapy


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1',
    ]

    def parse(self, response):
        for quote in response.css('ul.rp-1qtpzi4 li'):
            yield {
                'address': quote.css('address.rp-o9b83y::text').get(),
                'name': quote.css('h2.rp-69f2r4::text').get(),
                'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
            }
© www.soinside.com 2019 - 2024. All rights reserved.