我正在努力抓捕该网站以寻找空缺职位:
[我看过开发工具,发现页面向此站点发出XHR请求以检索以JSON对象形式的职位空缺信息:
所以我就像“太好了!我可以使用这样的python程序在两秒钟内解析它:”
'''从bs4导入BeautifulSoup导入json导入请求
def crawl():
union = requests.get('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults').content
soup = BeautifulSoup(union, 'html.parser')
newDict = json.loads(str(soup))
for job in newDict['opportunities']:
print(job['Title'])
crawl() '''
事实证明,此页面仅返回62个职位中的20个。因此,我返回页面并加载整个页面(单击“查看更多机会”)
并且它说它向同一链接发送了另一个XHR请求,但是当我查看时仅显示20条记录。
如何从该页面抓取所有记录?如果有人可以解释幕后发生的事情,那将是很好的。我对网页抓取有点陌生,因此请您多加注意。
您不需要进行抓取,就像您说返回所有json的API是链接https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults但您需要在主体中设置此参数
import requests
headers = {
'Content-Type': 'application/json'
}
data = '{\n "opportunitySearch": {\n "Top": 62,\n "Skip": 0,\n "QueryString": "",\n "OrderBy": [\n {\n "Value": "postedDateDesc",\n "PropertyName": "PostedDate",\n "Ascending": false\n }\n ],\n "Filters": [\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 4,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 5,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 6,\n "extra": null,\n "values": [\n \n ]\n }\n ]\n },\n "matchCriteria": {\n "PreferredJobs": [\n \n ],\n "Educations": [\n \n ],\n "LicenseAndCertifications": [\n \n ],\n "Skills": [\n \n ],\n "hasNoLicenses": false,\n "SkippedSkills": [\n \n ]\n }\n}'
response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
print(response.text)
在这里使用熊猫(pip安装熊猫)
import requests
import pandas as pd
pd.set_option('display.width', 1000)
headers = {
'Content-Type': 'application/json'
}
data = '{\n "opportunitySearch": {\n "Top": 62,\n "Skip": 0,\n "QueryString": "",\n "OrderBy": [\n {\n "Value": "postedDateDesc",\n "PropertyName": "PostedDate",\n "Ascending": false\n }\n ],\n "Filters": [\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 4,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 5,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 6,\n "extra": null,\n "values": [\n \n ]\n }\n ]\n },\n "matchCriteria": {\n "PreferredJobs": [\n \n ],\n "Educations": [\n \n ],\n "LicenseAndCertifications": [\n \n ],\n "Skills": [\n \n ],\n "hasNoLicenses": false,\n "SkippedSkills": [\n \n ]\n }\n}'
response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
data=response.json()
df=pd.DataFrame.from_dict(data['opportunities'])
df= df[['Id','Title','RequisitionNumber','JobCategoryName','PostedDate']]
print(df.head(5))
[其中数据具有“ TOP” 62的位置会限制您的结果:
{
"opportunitySearch": {
"Top": 62,
"Skip": 0,
"QueryString": "",
"OrderBy": [
{
"Value": "postedDateDesc",
"PropertyName": "PostedDate",
"Ascending": false
}
],
"Filters": [
{
"t": "TermsSearchFilterDto",
"fieldName": 4,
"extra": null,
"values": [
]
},
{
"t": "TermsSearchFilterDto",
"fieldName": 5,
"extra": null,
"values": [
]
},
{
"t": "TermsSearchFilterDto",
"fieldName": 6,
"extra": null,
"values": [
]
}
]
},
"matchCriteria": {
"PreferredJobs": [
],
"Educations": [
],
"LicenseAndCertifications": [
],
"Skills": [
],
"hasNoLicenses": false,
"SkippedSkills": [
]
}
}