Python 请求,加载 AJAX 内容

问题描述 投票:0回答:1

我正在尝试从 URL 中抓取所有“AVAILABLE VACANCIES”

https://careers.sega.co.uk/vacancies?f%5B0%5D=country%3AUnited%20Kingdom

我写了以下代码:

import requests

def SEGA():

    data = []
    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'X-Requested-With': 'XMLHttpRequest',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Referer': 'https://careers.sega.co.uk/vacancies?f%5B0%5D=country%3AUnited%20Kingdom'
    }
    url = "https://careers.sega.co.uk/vacancies?f%5B0%5D=country%3AUnited%20Kingdom"
    page = requests.get(url, headers=headers)
    tree = html.fromstring(page.content)
    xpath = '//*[@id="content"]/section/div/div/div[*]/div[*]/div[*]/h3/a/text()'
    jobs = tree.xpath(xpath)
    for job in jobs:
        Title = (job)
        Location = "Brentford"
        Studio = "SEGA"
        data.append([Title,Location,Studio])
    return data

这将返回前 25 个角色,而加载页面时应该有 62 个角色。我正在努力使用请求加载内容。不确定如何让 AJAX 加载所有内容。

python python-requests screen-scraping
1个回答
0
投票
import requests
import json

uri = r'https://careers.sega.co.uk/views/ajax?f[0]=country%3AUnited%20Kingdom&_wrapper_format=drupal_ajax'

formdata = r'search=&sort_by=search_api_relevance&items_per_page=All&view_name=jobs&view_display_id=page&view_args=&view_path=%2Fvacancies&view_base_path=vacancies&view_dom_id=fb1a232671720353ae08be0eb4a72fccadb593119768c5f1f3ef208cbac1be50&pager_element=0&_drupal_ajax=1&ajax_page_state%5Btheme%5D=careers&ajax_page_state%5Btheme_token%5D=&ajax_page_state%5Blibraries%5D=bootstrap_barrio%2Fform%2Cbootstrap_barrio%2Fglobal-styling%2Cbootstrap_barrio%2Fmessages_light%2Ccareers%2Fglobal-styling%2Ccareers%2Fswiper%2Ccareers%2Fyoutube-api%2Ccareers_civic%2Fcareers-civic%2Ccareers_civic%2Fcivic%2Cfacets%2Fdrupal.facets.link-widget%2Cfacets%2Fdrupal.facets.views-ajax%2Clazy%2Flazy%2Cparagraphs%2Fdrupal.paragraphs.unpublished%2Csearch_api_autocomplete%2Fsearch_api_autocomplete%2Csystem%2Fbase%2Cviews%2Fviews.ajax%2Cviews%2Fviews.module'

headers = {
    'Accept' : 'text/javascript',
    'Content-Type' : 'application/x-www-form-urlencoded; charset=UTF-8',
}

response = requests.post(uri, data=formdata, headers=headers)

data = json.loads(response.text)

html = data.html

看起来,在这种情况下,您仍然需要处理 html,但它确实包含了所有条目。

© www.soinside.com 2019 - 2024. All rights reserved.