无法从发出职位请求的网站中获取预期结果。

问题描述 投票:0回答:1

我正试图从一个叫做 网页 使用下面的脚本。以下是在该网站中填充结果的步骤。点击 AGREE 按钮,位于此底部 网页 然后在 编辑搜索 按钮,最后在 显示结果 按钮而不做任何改变。

我试过这样的方法。

import requests
from bs4 import BeautifulSoup

url = 'http://finra-markets.morningstar.com/BondCenter/Results.jsp'
post_url = 'http://finra-markets.morningstar.com/bondSearch.jsp'

payload = {
    'postData': {'Keywords':[]},
    'ticker': '',
    'startDate': '',
    'endDate': '',
    'showResultsAs': 'B',
    'debtOrAssetClass': '1,2',
    'spdsType': ''
}

payload_second = {
    'count': '20',
    'searchtype': 'B',
    'query': {"Keywords":[{"Name":"debtOrAssetClass","Value":"3,6"},{"Name":"showResultsAs","Value":"B"}]},
    'sortfield': 'issuerName',
    'sorttype': '1',
    'start': '0',
    'curPage': '1'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
    s.headers['Referer'] = 'http://finra-markets.morningstar.com/BondCenter/UserAgreement.jsp'
    r = s.post(url,json=payload)
    s.headers['Access-Control-Allow-Headers'] = r.headers['Access-Control-Allow-Headers']
    s.headers['cf-request-id'] = r.headers['cf-request-id']
    s.headers['CF-RAY'] = r.headers['CF-RAY']
    s.headers['X-Requested-With'] = 'XMLHttpRequest'
    s.headers['Origin'] = 'http://finra-markets.morningstar.com'
    s.headers['Referer'] = 'http://finra-markets.morningstar.com/BondCenter/Results.jsp'
    r = s.post(post_url,json=payload_second)
    print(r.content)

这是我运行上述脚本时得到的结果。

b'\n\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\n\n\n{}'

我怎样才能让脚本在该网站上显示出预期的结果?

P.S.我不希望用selenium来完成这个任务。

python python-3.x web-scraping http-post
1个回答
2
投票

响应为 http://finra-markets.morningstar.com/BondCenter/Results.jsp 不包含搜索结果。它一定是在异步地获取数据。

要找出哪些网络请求返回了搜索结果,一个简单的方法是使用Firefox的搜索请求来搜索其中一个搜索结果。开发工具:

The search button

Searching the requests

为了将HTTP请求转换为Python请求,我将Firefox的请求复制为CURL代码方法,将其导入Postman,然后将其导出为Python代码(我知道有点啰嗦(而且很懒)!)。

Copy request as CURL

Import button in Postman

Import CURL request in Postman

The 'code' button in Postman

The dropdown to select the code language in Postman

所有这些都导致了下面的代码。

import requests

url = "http://finra-markets.morningstar.com/bondSearch.jsp"

payload = "count=20&searchtype=B&query=%7B%22Keywords%22%3A%5B%7B%22Name%22%3A%22debtOrAssetClass%22%2C%22Value%22%3A%223%2C6%22%7D%2C%7B%22Name%22%3A%22showResultsAs%22%2C%22Value%22%3A%22B%22%7D%5D%7D&sortfield=issuerName&sorttype=1&start=0&curPage=1"
headers = {
    'User-Agent': "...",
    'Accept': "text/plain, */*; q=0.01",
    'Accept-Language': "en-US,en;q=0.5",
    'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8",
    'X-Requested-With': "XMLHttpRequest",
    'Origin': "http://finra-markets.morningstar.com",
    'DNT': "1",
    'Connection': "keep-alive",
    'Referer': "http://finra-markets.morningstar.com/BondCenter/Results.jsp",
    'Cookie': "...",
    'cache-control': "no-cache"
    }

response = requests.request("POST", url, data=payload, headers=headers)

print(response.text)

响应不是100%的JSON。所以我只是去掉了外部的空白,然后 {B:..} 部分。

>>> text = response.text.strip()[3:-1]
>>> import json
>>> data = json.loads(text)
>>> data['Columns'][0]                                                                                                             
{'moodyRating': {'ratingText': '', 'ratingNumber': 0},
 'fitchRating': {'ratingText': None, 'ratingNumber': None},
 'standardAndPoorRating': {'ratingText': '', 'ratingNumber': 0},
© www.soinside.com 2019 - 2024. All rights reserved.