我正在学习Scrapy,我想抓this site。
在我的蜘蛛中:
import scrapy
class TencentHrSpider(scrapy.Spider):
name = 'tencent_hr'
allowed_domains = ['careers.tencent.com']
start_urls = ['http://careers.tencent.com/search.html']
def parse(self, response):
div_list = response.xpath('//div[@class="recruit-list"]')
print(div_list) # there get `[]`, no data in it.
当我开始抓取时,没有数据输出。为什么?
我已经在settings.py
中设置了请求标头User-Agent:
USER_AGENT_LIST=[
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
import random
USER_AGENT = random.choice(USER_AGENT_LIST)
编辑-01
是否可以找到原因?任何错误日志要跟踪?
EDIT -02
为什么AJAX从API请求数据,Scrapy无法获取数据?我们知道它可以下载整个页面,是否可以像浏览器一样运行脚本?
该网站使用Javascript,因此将使抓取更加困难。该网站说明了如何处理。请让我知道是否对您有帮助。