我有以下从python请求返回的内容:
{"error":{"ErrorMessage":"
<div>
<p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here
<a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/p><\\/div>","CodeName":"Success","ErrorStatus":0},"calendar":{"calendar":"
<div class=\\"wsResponse\\">To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here
<a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/div>"},"binCollections":{"tile":[["
<div class=\'collectionDiv\'>
<div class=\'fullwidth\'>
<h3>Organic Collection Service (Brown Organic Bin)<\\/h3><\\/div>
<div class=\\"collectionImg\\">
<img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/brown bin.png\\" \\/><\\/div>\\n
<div class=\'wdshDetWrap\'>Your brown organic bin collection is
<b>Fortnightly<\\/b> on a
<b>Thursday<\\/b>.
<br\\/> \\n Your next scheduled collection is
<b>Friday, 29 May 2020<\\/b>.
<br\\/>
<br\\/>
<a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3427\\">Read more about the Organic Collection Service ><\\/a><\\/div><\\/div>"],["
<div class=\'collectionDiv\'>
<div class=\'fullwidth\'>
<h3>Recycling Collection Service (Recycling Sacks)<\\/h3><\\/div>
<div class=\\"collectionImg\\">
<img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/SH_two_rec_sacks.png\\" \\/><\\/div>\\n
<div class=\'wdshDetWrap\'>Your recycling sacks collection is
<b>Fortnightly<\\/b> on a
<b>Thursday<\\/b>.
<br\\/> \\n Your next scheduled collection is
<b>Friday, 29 May 2020<\\/b>.
<br\\/>
<br\\/>
<a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3383\\">Read more about the Recycling Collection Service ><\\/a><\\/div><\\/div>"],["
<div class=\'collectionDiv\'>
<div class=\'fullwidth\'>
<h3>Refuse Collection Service (Grey Refuse Bin)<\\/h3><\\/div>
<div class=\\"collectionImg\\">
<img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/grey bin.png\\" \\/><\\/div>\\n
<div class=\'wdshDetWrap\'>Your grey refuse bin collection is
<b>Fortnightly<\\/b> on a
<b>Thursday<\\/b>.
<br\\/> \\n Your next scheduled collection is
<b>Thursday, 04 June 2020<\\/b>.
<br\\/>
<br\\/>
<a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3384\\">Read more about the Refuse Collection Service ><\\/a><\\/div><\\/div>"]]}}
我想为每个collectiondiv(3)提取以下内容:>
有机物收集服务(棕色有机垃圾桶)2020年5月29日,星期五]
[回收收集服务(回收袋)2020年5月29日,星期五]
垃圾收集服务(灰色垃圾桶)2020年6月4日,星期四
<感谢您的关注
请求代码:
url = "https://southhams.fccenvironment.co.uk/mycollections"
response = requests.request("GET", url)
cookiejar = response.cookies
for cookie in cookiejar:
print(cookie.name,cookie.value)
url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"
payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}
response = requests.request("POST", url, headers=headers, data = payload)
print(response.status_code)
我具有从python请求返回的以下内容:{“ error”:{“ ErrorMessage”:“为了保护您的隐私,此表格不会显示诸如临床或辅助信息之类的详细信息。 ..] >>您直接获取json,然后可以调用该html值。完成后,使用beautifulsoup解析html并在找到标记的上下文/文本中打印出来:
import requests from bs4 import BeautifulSoup url = "https://southhams.fccenvironment.co.uk/mycollections" response = requests.get(url) cookiejar = response.cookies for cookie in cookiejar: print(cookie.name,cookie.value) url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails" payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value) headers = { 'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded', 'Cookie': 'fcc_session_cookie={}'.format(cookie.value) } jsonData = requests.post(url, headers=headers, data = payload).json() data = jsonData['binCollections']['tile'] for each in data: soup = BeautifulSoup(each[0], 'html.parser') collection = soup.find('div', {'class':'collectionDiv'}).find('h3').text.strip() date = soup.find_all('b')[-1].text.strip() print (collection, date)
输出:
Organic Collection Service (Brown Organic Bin) Friday, 29 May 2020 Recycling Collection Service (Recycling Sacks) Friday, 29 May 2020 Refuse Collection Service (Grey Refuse Bin) Thursday, 04 June 2020
特定网站的HTML文档格式不正确。我仍然设法解决(在大约1000个标签的范围内效率低下)。因此可以改进。
headers = soup.find_all('h3') names = [tag.text[:tag.text.find('<')] for tag in headers] dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers] print(names) print(dates) #Output ['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)'] ['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']