了解使用python 3.8和请求进行网络抓取的承载授权

问题描述 投票:0回答:1

因此,我希望抓取以下网站:

https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland

我在使用Python Requests库时遇到的问题是,标头要求我传递带有某种令牌的Authorization标头。如果我可以手动运行该页面,复制并粘贴然后运行程序,则可以使它正常工作,但我想知道如何绕过此问题(毕竟,如果我仍然要运行刮板,那又有什么意义呢?必须手动访问实际站点并检索授权令牌)。

我对授权/承载标头不熟悉,希望有人能够弄清浏览器如何生成令牌来检索此信息/我如何进行模拟。这是我的代码:

import requests
import json
import datetime

today = datetime.datetime.today()

url = "https://hyland.csod.com/services/x/career-site/v1/search"

# actual sitehttps://hyland.csod.com/ux/ats/careersite/4/home?c=hyland

headers = {
    'authority': 'hyland.csod.com',
    'origin': 'https://hyland.csod.com',
    'authorization': 'Bearer eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCIsImNsaWQiOiI0bDhnbnFhbGk3NjgifQ.eyJzdWIiOi0xMDMsImF1ZCI6IjRxNTFzeG5oY25yazRhNXB1eXZ1eGh6eCIsImNvcnAiOiJoeWxhbmQiLCJjdWlkIjoxLCJ0emlkIjoxNCwibmJkIjoiMjAxOTEyMzEyMTE0MTU5MzQiLCJleHAiOiIyMDE5MTIzMTIyMTUxNTkzNCIsImlhdCI6IjIwMTkxMjMxMjExNDE1OTM0In0.PlNdWXtb1uNoMuGIhI093ZbheRN_DwENTlkNoVr0j7Zah6JHd5cukudVFnZEiQmgBZ_nlDU4C-9JO_2We380Vg',
    'content-type': 'application/json',
    'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
    'csod-accept-language': 'en-US',
    'referer': 'https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland',
    'accept-encoding': 'gzip, deflate, br',
    'cookie': 'CYBERU_lastculture=en-US; ASP.NET_SessionId=4q51sxnhcnrk4a5puyvuxhzx; cscx=hyland^|-103^|1^|14^|KumB4VhzYXML22MnMxjtTB9SKgHiWW0tFg0HbHnOek4=; c-s=expires=1577909201~access=/clientimg/hyland/*^!/content/hyland/*~md5=78cd5252d2efff6eb77d2e6bf0ce3127',
}



data = ['{"careerSiteId":4,"pageNumber":1,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}',
        '{"careerSiteId":4,"pageNumber":2,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}']

def hyland(url, data):
    # for openings in data:

    dirty = requests.post(url, headers=headers, data=data).text

    if 'Unauthorized' in dirty:
        print(dirty)
        print("There was an error connecting. Check Info")

    # print(dirty)
    clean = json.loads(dirty)
    cleaner = json.dumps(clean, indent=4)
    print("Openings at Hyland Software in Westlake as of {}".format(today.strftime('%m-%d-%Y')))
    for i in range(0,60):
        try:
            print(clean["data"]["requisitions"][i]["displayJobTitle"])
            print("")
            print("")
        except:
            print("{} Openings at Hyland".format(i))
            break

for datum in data:    
    hyland(url, data=datum)

因此,我的代码基本上是在向上面的url发送一个发布请求以及标头和必要的数据,以检索我想要的内容。该刮板可以在短时间内工作,但是如果我离开几个小时后再回来,则由于授权(它至少是我得出的结论)而不再起作用。

对于所有这些工作原理的任何帮助/说明,我们将不胜感激。

python-3.x web-scraping python-requests authorization bearer-token
1个回答
0
投票

您的代码有一些问题:

  • 正如您所提到的,您必须获得不记名令牌

  • 您必须使用requests.session()发送请求(因为此网页似乎关注您发送的cookie)]] >>

  • 可选:您的requests.session()有很多不必要的标题可以删除

  • 总而言之,下面是工作代码:

headers

希望这会有所帮助

© www.soinside.com 2019 - 2024. All rights reserved.