我正在使用scrapy来获取页面上某些网址内的内容,类似于这里的问题:Use scrapy to get list of urls, and then scrape content inside those urls
我可以从我的开始网址(第一个def)获得subURL,但是,我的第二个def似乎没有通过。结果文件为空。我已经测试了scrapy shell中函数内部的内容,它正在获取我想要的信息,但是当我运行蜘蛛时却没有。
import scrapy
from scrapy.selector import Selector
#from scrapy import Spider
from WheelsOnlineScrapper.items import Dealer
from WheelsOnlineScrapper.url_list import urls
import logging
from urlparse import urljoin
logger = logging.getLogger(__name__)
class WheelsonlinespiderSpider(scrapy.Spider):
logger.info('Spider starting')
name = 'wheelsonlinespider'
rotate_user_agent = True # lives in middleware.py and settings.py
allowed_domains = ["https://wheelsonline.ca"]
start_urls = urls # this list is created in url_list.py
logger.info('URLs retrieved')
def parse(self, response):
subURLs = []
partialURLs = response.css('.directory_name::attr(href)').extract()
for i in partialURLs:
subURLs = urljoin('https://wheelsonline.ca/', i)
yield scrapy.Request(subURLs, callback=self.parse_dealers)
logger.info('Dealer ' + subURLs + ' fetched')
def parse_dealers(self, response):
logger.info('Beginning of page')
dlr = Dealer()
#Extracting the content using css selectors
try:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first()
except TypeError:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first()
dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract())
dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first()
logger.info('Dealer fetched ' + dlr['DealerName'])
yield dlr
logger.info('End of page')
您的allowed_domains
列表包含协议(https
)。它应该只有documentation的域名:
allowed_domains = ["wheelsonline.ca"]
此外,您应该在日志中收到一条消息:
URLWarning:allowed_domains仅接受域,而不接受URL。忽略allowed_domains中的URL条目https://wheelsonline.ca