从html / json页面提取特定部分的最佳方法?

问题描述 投票:1回答:2

我有以下从python请求返回的内容:

{"error":{"ErrorMessage":"
<div>
<p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
    <a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/p><\\/div>","CodeName":"Success","ErrorStatus":0},"calendar":{"calendar":"
        <div class=\\"wsResponse\\">To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
            <a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/div>"},"binCollections":{"tile":[["
                <div class=\'collectionDiv\'>
                    <div class=\'fullwidth\'>
                        <h3>Organic Collection Service (Brown Organic Bin)<\\/h3><\\/div>
                            <div class=\\"collectionImg\\">
                                <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/brown bin.png\\" \\/><\\/div>\\n                    
                                <div class=\'wdshDetWrap\'>Your brown organic bin collection is 
                                    <b>Fortnightly<\\/b> on a 
                                        <b>Thursday<\\/b>.
                                            <br\\/> \\n                    Your next scheduled collection is 
                                            <b>Friday, 29 May 2020<\\/b>. 
                                                <br\\/>
                                                <br\\/>
                                                <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3427\\">Read more about the Organic Collection Service &gt;<\\/a><\\/div><\\/div>"],["
                                                    <div class=\'collectionDiv\'>
                                                        <div class=\'fullwidth\'>
                                                            <h3>Recycling Collection Service (Recycling Sacks)<\\/h3><\\/div>
                                                                <div class=\\"collectionImg\\">
                                                                    <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/SH_two_rec_sacks.png\\" \\/><\\/div>\\n                    
                                                                    <div class=\'wdshDetWrap\'>Your recycling sacks collection is 
                                                                        <b>Fortnightly<\\/b> on a 
                                                                            <b>Thursday<\\/b>.
                                                                                <br\\/> \\n                    Your next scheduled collection is 
                                                                                <b>Friday, 29 May 2020<\\/b>. 
                                                                                    <br\\/>
                                                                                    <br\\/>
                                                                                    <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3383\\">Read more about the Recycling Collection Service &gt;<\\/a><\\/div><\\/div>"],["
                                                                                        <div class=\'collectionDiv\'>
                                                                                            <div class=\'fullwidth\'>
                                                                                                <h3>Refuse Collection Service (Grey Refuse Bin)<\\/h3><\\/div>
                                                                                                    <div class=\\"collectionImg\\">
                                                                                                        <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/grey bin.png\\" \\/><\\/div>\\n                    
                                                                                                        <div class=\'wdshDetWrap\'>Your grey refuse bin collection is 
                                                                                                            <b>Fortnightly<\\/b> on a 
                                                                                                                <b>Thursday<\\/b>.
                                                                                                                    <br\\/> \\n                    Your next scheduled collection is 
                                                                                                                    <b>Thursday, 04 June 2020<\\/b>. 
                                                                                                                        <br\\/>
                                                                                                                        <br\\/>
                                                                                                                        <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3384\\">Read more about the Refuse Collection Service &gt;<\\/a><\\/div><\\/div>"]]}}

我想为每个collectiondiv(3)提取以下内容:>

有机物收集服务(棕色有机垃圾桶)2020年5月29日,星期五]

[回收收集服务(回收袋)2020年5月29日,星期五]

垃圾收集服务(灰色垃圾桶)2020年6月4日,星期四

<

感谢您的关注

请求代码:

url = "https://southhams.fccenvironment.co.uk/mycollections" response = requests.request("GET", url) cookiejar = response.cookies for cookie in cookiejar: print(cookie.name,cookie.value) url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails" payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value) headers = { 'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded', 'Cookie': 'fcc_session_cookie={}'.format(cookie.value) } response = requests.request("POST", url, headers=headers, data = payload) print(response.status_code)

我具有从python请求返回的以下内容:{“ error”:{“ ErrorMessage”:“

为了保护您的隐私,此表格不会显示诸如临床或辅助信息之类的详细信息。 ..] >>

您直接获取json,然后可以调用该html值。完成后,使用beautifulsoup解析html并在找到标记的上下文/文本中打印出来:

import requests from bs4 import BeautifulSoup url = "https://southhams.fccenvironment.co.uk/mycollections" response = requests.get(url) cookiejar = response.cookies for cookie in cookiejar: print(cookie.name,cookie.value) url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails" payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value) headers = { 'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded', 'Cookie': 'fcc_session_cookie={}'.format(cookie.value) } jsonData = requests.post(url, headers=headers, data = payload).json() data = jsonData['binCollections']['tile'] for each in data: soup = BeautifulSoup(each[0], 'html.parser') collection = soup.find('div', {'class':'collectionDiv'}).find('h3').text.strip() date = soup.find_all('b')[-1].text.strip() print (collection, date)

输出:

Organic Collection Service (Brown Organic Bin) Friday, 29 May 2020 Recycling Collection Service (Recycling Sacks) Friday, 29 May 2020 Refuse Collection Service (Grey Refuse Bin) Thursday, 04 June 2020
特定网站的HTML文档格式不正确。我仍然设法解决(在大约1000个标签的范围内效率低下)。

因此可以改进。

headers = soup.find_all('h3') names = [tag.text[:tag.text.find('<')] for tag in headers] dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers] print(names) print(dates) #Output ['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)'] ['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']

python html json beautifulsoup lxml
2个回答
3
投票
您直接获取json,然后可以调用该html值。完成后,使用beautifulsoup解析html并在找到标记的上下文/文本中打印出来:

0
投票
特定网站的HTML文档格式不正确。我仍然设法解决(在大约1000个标签的范围内效率低下)。
© www.soinside.com 2019 - 2024. All rights reserved.