遍历 24 页 - 使用解析器脚本从 a 到 z:通过请求和 xpath 获取从美国到 zypria(塞浦路斯)的医院...

问题描述 投票:0回答:1
我目前正在制作一个爬虫:“医生和医疗机构:全球列表”

其中列出了世界各地会说英语的医生、医疗机构和从业人员,以帮助海外的英国人获得医疗保健。

注意:有一个从a到z的列表:

https://www.gov.uk/government/collections/doctors-and-medical-facilities-worldwide-list#b

查看以色列的医疗设施列表:

https://www.gov.uk/government/publications/cyprus-list-of-hospitals https://www.gov.uk/government/publications/israel-list-of-medical-facilities

我的

方法是第一眼看到页面#s

第一步,我选择一页 - 从许多页面中选择(请参阅:

https://www.gov.uk/government/collections/doctors-and-medical-facilities-worldwide-list#b

我添加了一个函数来处理找不到表的情况,并添加了一些错误处理以使其更加健壮:

import requests from bs4 import BeautifulSoup import pandas as pd # Function to scrape data from a given URL def scrape_medical_facilities(url): response = requests.get(url) soup = BeautifulSoup(response. text, 'html.parser') # Find the table table = soup.find('table') # Check if the table exists if table: # Initialize lists to store data names = [] addresses = [] # Iterate through rows in the table for row in table.find_all('tr')[1:]: # Skip header row columns = row.find_all('td') name = columns[0].get_text(strip=True) address = columns[1].get_text(strip=True) names.append(name) addresses.append(address) # Create a DataFrame df = pd.DataFrame({'Name': names, 'Address': addresses}) else: # If the table is not found, create an empty DataFrame df = pd.DataFrame(columns=['Name', 'Address']) return df # URLs for medical facilities in Israel israel_medical_facilities_urls = [ 'https://www.gov.uk/government/publications/israel-list-of-medical-facilities', # Add more URLs if there are multiple pages ] # Scrape data from each URL and concatenate into a single DataFrame df_israel = pd.concat([scrape_medical_facilities(url) for url in israel_medical_facilities_urls], ignore_index=True) # Save the DataFrame to a CSV file df_israel.to_csv('israel_medical_facilities.csv', index=False)
    
python apache web-scraping google-colaboratory
1个回答
0
投票
你的问题不清楚,但我想我明白你想要什么。你想要获得从 A 到 Z 的所有链接。我有这个代码片段可以完成这项工作:

s = BeautifulSoup(requests.get('https://www.gov.uk/government/collections/doctors-and-medical-facilities-worldwide-list#b').text, 'html.parser') links = [div.find("a") for div in s.find_all('div', {'class':'gem-c-document-list__item-title'})]
基本上,它会查找具有类 

a

div
 标签内的所有 
gem-c-document-list__item-title
 标签。

如果这不能回答您的问题,请尝试重新表述您的问题并发布正确的链接。

© www.soinside.com 2019 - 2024. All rights reserved.