我正在尝试从 URL 中抓取所有“AVAILABLE VACANCIES”
https://careers.sega.co.uk/vacancies?f%5B0%5D=country%3AUnited%20Kingdom
我写了以下代码:
import requests
def SEGA():
data = []
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'https://careers.sega.co.uk/vacancies?f%5B0%5D=country%3AUnited%20Kingdom'
}
url = "https://careers.sega.co.uk/vacancies?f%5B0%5D=country%3AUnited%20Kingdom"
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
xpath = '//*[@id="content"]/section/div/div/div[*]/div[*]/div[*]/h3/a/text()'
jobs = tree.xpath(xpath)
for job in jobs:
Title = (job)
Location = "Brentford"
Studio = "SEGA"
data.append([Title,Location,Studio])
return data
这将返回前 25 个角色,而加载页面时应该有 62 个角色。我正在努力使用请求加载内容。不确定如何让 AJAX 加载所有内容。
import requests
import json
uri = r'https://careers.sega.co.uk/views/ajax?f[0]=country%3AUnited%20Kingdom&_wrapper_format=drupal_ajax'
formdata = r'search=&sort_by=search_api_relevance&items_per_page=All&view_name=jobs&view_display_id=page&view_args=&view_path=%2Fvacancies&view_base_path=vacancies&view_dom_id=fb1a232671720353ae08be0eb4a72fccadb593119768c5f1f3ef208cbac1be50&pager_element=0&_drupal_ajax=1&ajax_page_state%5Btheme%5D=careers&ajax_page_state%5Btheme_token%5D=&ajax_page_state%5Blibraries%5D=bootstrap_barrio%2Fform%2Cbootstrap_barrio%2Fglobal-styling%2Cbootstrap_barrio%2Fmessages_light%2Ccareers%2Fglobal-styling%2Ccareers%2Fswiper%2Ccareers%2Fyoutube-api%2Ccareers_civic%2Fcareers-civic%2Ccareers_civic%2Fcivic%2Cfacets%2Fdrupal.facets.link-widget%2Cfacets%2Fdrupal.facets.views-ajax%2Clazy%2Flazy%2Cparagraphs%2Fdrupal.paragraphs.unpublished%2Csearch_api_autocomplete%2Fsearch_api_autocomplete%2Csystem%2Fbase%2Cviews%2Fviews.ajax%2Cviews%2Fviews.module'
headers = {
'Accept' : 'text/javascript',
'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
}
response = requests.post(uri, data=formdata, headers=headers)
data = json.loads(response.text)
html = data.html
看起来,在这种情况下,您仍然需要处理 html,但它确实包含了所有条目。