嗨,我是scrapy的新手,我正在尝试从alibaba的Product By categories页面中删除类别的类别和URL。我正在尝试抓取它并将其放在CSV文件中。
当我在电子表格中打开它时,我想给出的视图是: -
categories categories_urls
Agricultural Growing Media its URL
Animal Products its URL
. .
. .
. .
# -*- coding: utf-8 -*-
import scrapy
class AlibabaCatagoriesSpider(scrapy.Spider):
name = 'alibaba_catagories'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/Products?spm=a2700.8293689.scGlobalHomeHeader.352.2ce265aa7GOmOF']
def parse(self, response):
a = response.css('ul.sub-item-cont')
for catag in a:
item = {
'categories': catag.css('li>a::text').extract(),
'categories_url': catag.css('li>a::attr(href)').extract()
}
yield item
理想的格式。
Scrapy很容易:
def parse(self, response):
for category_node in response.xpath('//ul[contains(@class, "sub-item-cont")]/li/a'):
item = {
'categories': category_node.xpath('./text()').extract_first().strip(),
'categories_url': category_node.xpath('./@href').extract_first()
}
yield item
import requests
from bs4 import BeautifulSoup
def parser():
url = 'https://www.alibaba.com/Products?spm=a2700.8293689.scGlobalHomeHeader.352.2ce265aa7GOmOF'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
name_dict = {}
for l in soup.find_all('li'):
content = l.find('a')
if content:
href = content.get('href')
name = content.get_text()
if href.find('_pid') != -1:
name_dict[name] = href
return name_dict
这是由BeautifulSoup模块制作的,因为它更容易刮擦它。该函数将返回一个字典,其中键作为名称,值为url。
您必须使用normalize-space函数来删除空格。 .css
选择器不可用或非常复杂。我建议你使用XPath
。如此处所述。 normalize-space just works with xpath not css selector
使用规范化空间函数的Xpath示例
Product= response.xpath('normalize-space(//*[@class="column one3"]/a/@href)').extract()
尝试跟随选择器
list(map(lambda x: x.replace('\n', '').strip(), response.xpath('//*[@class="cg-main"]//*[contains(@class, "sub-item-cont")]//li/a[@href]/text()').extract()))