Python - 使用HTML标记进行Web抓取

Question

我正在尝试抓取一个网页列出URL中发布的作业：https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad

有关网页检查Web inspect的详细信息，请参阅图像

通过网页检查观察到以下情况：

列出的每个作业都在HTML li中，其中class =“jobs-list-item”。 Li在li中的父Div中包含以下html标记和数据 data-ph-at-job-title-text =“软件工程师II”，data-ph-at-job-category-text =“工程”，data-ph-at-job-post-date-text =“2018 -03-19T16：33：00" 。
父类Div中的第一个子Div与class =“information”的HTML包含url href =“https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II”
在父Div中具有class =“description au-target”的第3个子Div具有简短的工作描述

我的要求是提取每项工作的以下信息

职称
工作类别
工作日期
职位发布时间
工作网址
职位简介

我尝试使用Python代码来抓取网页，但无法提取所需的信息。（请忽略下面代码中显示的缩进）

import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)

if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")

ms_jobs()

Answer 1

如果您想通过请求执行此操作，则需要对站点进行反向工程。在Chrome中打开开发工具，选择网络标签并填写表单。

这将显示网站如何加载数据。如果您深入了解您将看到的站点，它会通过对此端点执行POST来获取数据：https://careers.microsoft.com/widgets。它还会显示站点使用的有效负载。该站点使用cookie，因此您所要做的就是创建一个会话来保存cookie，获取一个并复制/粘贴有效负载。

通过这种方式，您将能够提取相同的json数据，javascript将提取这些数据以动态填充网站。

下面是一个看起来像的工作示例。左边只是为了解析你认为合适的json。

import requests
from pprint import pprint

# create a session to grab a cookie from the site
session = requests.Session()
r = session.get("https://careers.microsoft.com/us/en/")

# these params are the ones that the dev tools show that site sets when using the website form
payload = {
    "lang":"en_us",
    "deviceType":"desktop",
    "country":"us",
    "ddoKey":"refineSearch",
    "sortBy":"",
    "subsearch":"",
    "from":0,
    "jobs":"true",
    "counts":"true",
    "all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
    "pageName":"search-results",
    "size":20,
    "keywords":"",
    "global":"true",
    "selected_fields":{"city":["Hyderabad"],"country":["India"]},
    "sort":"null",
    "locationData":{}
}

# this is the endpoint the site uses to fetch json
url = "https://careers.microsoft.com/widgets"
r = session.post(url, json=payload)
data = r.json()
job_list = data['refineSearch']['data']['jobs']

# the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
job = job_list[0]
pprint(job)

干杯。

Python - 使用HTML标记进行Web抓取

问题描述投票：0回答：1

1个回答

最新问题

Python - 使用HTML标记进行Web抓取

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1