[仅从网站中获取JSON的一部分,我正在尝试使用Python,BeautifulSoup,Requests进行抓取。从62个中获得20个回复

问题描述 投票:0回答:1

我正在努力抓捕该网站以寻找空缺职位:

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/?q=&o=postedDateDesc&w=&wc=&we=&wpst=

[我看过开发工具,发现页面向此站点发出XHR请求以检索以JSON对象形式的职位空缺信息:

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults

所以我就像“太好了!我可以使用这样的python程序在两秒钟内解析它:”

'''从bs4导入BeautifulSoup导入json导入请求

def crawl():
    union = requests.get('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults').content
    soup = BeautifulSoup(union, 'html.parser')
    newDict = json.loads(str(soup))
    for job in newDict['opportunities']:
        print(job['Title'])

crawl() '''

事实证明,此页面仅返回62个职位中的20个。因此,我返回页面并加载整个页面(单击“查看更多机会”)

并且它说它向同一链接发送了另一个XHR请求,但是当我查看时仅显示20条记录。

如何从该页面抓取所有记录?如果有人可以解释幕后发生的事情,那将是很好的。我对网页抓取有点陌生,因此请您多加注意。

python json web-scraping beautifulsoup xmlhttprequest
1个回答
0
投票

您不需要进行抓取,就像您说返回所有json的API是链接https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults但您需要在主体中设置此参数

import requests

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
print(response.text)

在这里使用熊猫(pip安装熊猫)

import requests
import pandas as pd
pd.set_option('display.width', 1000)

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
data=response.json()
df=pd.DataFrame.from_dict(data['opportunities'])
df= df[['Id','Title','RequisitionNumber','JobCategoryName','PostedDate']]
print(df.head(5))

[其中数据具有“ TOP” 62的位置会限制您的结果:

{
  "opportunitySearch": {
    "Top": 62,
    "Skip": 0,
    "QueryString": "",
    "OrderBy": [
      {
        "Value": "postedDateDesc",
        "PropertyName": "PostedDate",
        "Ascending": false
      }
    ],
    "Filters": [
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 4,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 5,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 6,
        "extra": null,
        "values": [

        ]
      }
    ]
  },
  "matchCriteria": {
    "PreferredJobs": [

    ],
    "Educations": [

    ],
    "LicenseAndCertifications": [

    ],
    "Skills": [

    ],
    "hasNoLicenses": false,
    "SkippedSkills": [

    ]
  }
}
© www.soinside.com 2019 - 2024. All rights reserved.