使用 scrapy 从此网站抓取数据

问题描述 投票:0回答:2

我对数据抓取和学习诀窍一窍不通 我将从该网站抓取数据值,https://www.twhouse.co.uk/index.php?route=product/catalog

我正在使用 scrapy shell 来询问和组装我的爬虫。 当我发起回应时,

response.css('div.caption span.stat-1').get()
我收到了这个

<span class="stat-1"><span class="stats-label">SKU:</span> <span>8644</span></span>

我想提取

sku
的值。 谢谢大家的支持。

从 Scrapy shell 我想询问 url https://www.twhouse.co.uk/index.php?route=product/catalog

response.css('div.caption span.stat-1').get()
给了我这个:

<span class="stat-1"><span class="stats-label">SKU:</span> <span>8644</span></span>

当我将其更改为

sku
时,我只想要其中的
response.css('div.caption span.stats-label').get()
值 我得到这个
<span class="stats-label">SKU:</span>
,当我插入
::text
response.css('div.caption span.stats-label::text').get()
时,我得到了这个响应
SKU:
,而不是
sku
值。我如何获得该值?

python scrapy
2个回答
2
投票

HTML 看起来像这样:

...
...
<span class="stat-1">
    <span class="stats-label">SKU:</span>
    <span>8811</span>
</span>
...
...

因此,您想获取外部

span
标签内的第二个(最后一个)
span
标签(“stat-1”)。

scrapy shell https://www.twhouse.co.uk/index.php?route=product/catalog

>>> response.css('div.caption span.stat-1 span:last-child::text').get()
'8811'

如果您想获取所有文本,可以使用 getall(),您将获得它们作为列表。

scrapy shell https://www.twhouse.co.uk/index.php?route=product/catalog

>>> response.css('div.caption span.stat-1 span:last-child::text').getall()
['8811', '8943', '8939', '8730', '8853', '8748', '8901', '8756', '8855', '8951', '8838', '8857', '8934', '8856', '8924', '9050', '8862', '8863', '8764', '9047', '9045', '9055', '8746', '8814', '8714', '8760', '8944', '8958', '8959', '8722', '8743', '8785', '8946', '8860', '8877', '8715', '9011', '8945', '9023', '8947', '9015', '8777', '8753', '8797', '8899', '8734', '8705', '9042', '8936', '8787', '8950', '8888', '8723', '9018', '9019', '8948', '8942', '8890', '8969', '8906', '8907', '8960', '9021', '8713', '9009', '9014', '9022', '8831', '8707', '8724', '9033', '9024', '9038', '8829', '9034', '9027', '9025', '9031', '9026', '9029', '9030', '9032', '9041', '9039', '9051', '9028', '8994', '8765', '8977', '8808', '8978', '8809', '8876', '9008', '8883', '8768', '8823', '8740', '8873', '9013']

0
投票

尝试访问

span
 中的第二个 
stat-1

子级
response.css('div.caption span.stat-1 span:nth-child(2)::text').get()
© www.soinside.com 2019 - 2024. All rights reserved.