我正在建立一个工作委员会,其中涉及从公司站点中收集工作数据。我目前正在尝试在https://www.twilio.com/company/jobs刮擦Twilio。但是,我没有得到工作数据本身,这似乎是刮板所遗漏的。基于其他问题,这可能是因为数据使用JavaScript编写,但这并不明显。
这是我使用的代码:
# Set the URL you want to webscrape from
url = 'https://www.twilio.com/company/jobs'
# Connect to the URL
response = requests.get(url)
if "_job-title" in response.text:
print "Found the jobs!" # FAILS
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
# To download the whole data set, let's do a for loop through all a tags
for i in range(0,len(soup.findAll('a', class_='_job'))): # href=True))): #'a' tags are for links
one_a_tag = soup.findAll('a', class_='_job')[i]
link = one_a_tag['href']
print link # FAILS
运行此代码时,不显示任何内容。我也尝试过使用urllib2,并且有相同的问题。硒有效,但工作太慢。 Scrapy看起来可能很有希望,但是我遇到了安装问题。
不同办公室中所有工作的基本信息是通过可在网络标签中找到的API调用动态返回的。如果从中提取ID,则可以使用这些ID分别请求详细的工作信息。示例如下所示:
import requests
from bs4 import BeautifulSoup as bs
listings = {}
with requests.Session() as s:
r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
for office in r['offices']:
for dept in office['departments']: #you could perform some filtering here or later on
if 'jobs' in dept:
for job in dept['jobs']:
listings[job['id']] = job #store basic job info in dict
for key in listings.keys():
r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
soup = bs(r.content, 'lxml')
job['soup'] = soup #store soup from detail page
print(soup.select_one('.app-title').text) #print example something from page