使用python-requests进行网络抓取时如何获取丢失的HTML数据

问题描述 投票:0回答:1

我正在建立一个工作委员会,其中涉及从公司站点中收集工作数据。我目前正在尝试在https://www.twilio.com/company/jobs刮擦Twilio。但是,我没有得到工作数据本身,这似乎是刮板所遗漏的。基于其他问题,这可能是因为数据使用JavaScript编写,但这并不明显。

这是我使用的代码:

# Set the URL you want to webscrape from
url = 'https://www.twilio.com/company/jobs'

# Connect to the URL
response = requests.get(url)

if "_job-title" in response.text:
    print "Found the jobs!"    # FAILS

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

# To download the whole data set, let's do a for loop through all a tags
for i in range(0,len(soup.findAll('a', class_='_job'))): # href=True))): #'a' tags are for links
    one_a_tag = soup.findAll('a', class_='_job')[i]
    link = one_a_tag['href']
    print link            # FAILS

运行此代码时,不显示任何内容。我也尝试过使用urllib2,并且有相同的问题。硒有效,但工作太慢。 Scrapy看起来可能很有希望,但是我遇到了安装问题。

这是我尝试访问的数据的屏幕截图:enter image description here

web-scraping python-requests
1个回答
1
投票

不同办公室中所有工作的基本信息是通过可在网络标签中找到的API调用动态返回的。如果从中提取ID,则可以使用这些ID分别请求详细的工作信息。示例如下所示:

import requests
from bs4 import BeautifulSoup as bs

listings = {}

with requests.Session() as s:
    r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
    for office in r['offices']:
        for dept in office['departments']: #you could perform some filtering here or later on 
            if 'jobs' in dept:
                for job in dept['jobs']:
                    listings[job['id']] = job  #store basic job info in dict
    for key in listings.keys():
        r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
        soup = bs(r.content, 'lxml')
        job['soup'] = soup #store soup from detail page
        print(soup.select_one('.app-title').text) #print example something from page
© www.soinside.com 2019 - 2024. All rights reserved.