我正在尝试使用Python-Beautifulsoup将该表的所有数据从该网站的所有页面抓取到字典中,如以下代码所示。但是,这只是返回一个空列表
此外,我也试图将每个都有自己单独页面的公司都刮到字典中。
from bs4 import BeautifulSoup
import requests
from pprint import pprint
case_data = []
case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url)
soup_case = BeautifulSoup(case_page.content, 'html.parser')
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})
pprint(case_table)
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get("https://masked_per_user_request/")
time.sleep(2)
df = pd.read_html(driver.page_source)[0]
df.to_csv('result.csv', index=False)
driver.quit()
输出:click here
[请注意,数据是通过XHR
后端的JSON
请求呈现的,因此XHR-URL因此,您可以通过POST
请求来调用它,包括JSON
主体数据和Cookies
类似于以下内容:
import requests
data = {
'message': '{"actions":[{"id":"108;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue","callingDescriptor":"UNKNOWN","params":{"html":"<p style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:</span></p><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&C;</span></li></ul><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The disclosure of said information threatens harm to that aim; and</span></li></ul><p style=\"text-align: justify;\"><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The harm to the aim is greater than the public interest in having the information disclosed.</span></li></ul><p> </p>"},"version":"47.0","storable":true},{"id":"88;a","descriptor":"apex://ComplaintsCaseController/ACTION$searchCaseList","callingDescriptor":"markup://c:CaseList","params":{"searchString":"","pageNumber":1,"defaultPageSize":"10"}},{"id":"111;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData","callingDescriptor":"UNKNOWN","params":{"uniqueNameOrId":"","pageType":""},"version":"47.0","storable":true}]}',
'aura.context': '{"mode":"PROD","fwuid":"5fuxCiO1mNHGdvJphU5ELQ","app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"0luQG4JZE_TU28tAfQgGSA"},"dn":[],"globals":{},"uad":false}',
'aura.pageURI': '/Complaint/s/casetracker',
'aura.token': 'undefined'
}
r = requests.post("https://masked_per_user_request/", json=data).json()
print(r)
您将需要找出Cookies参数。