BeautifulSoup:超过 24 个字符(从 a 到 z)的迭代失败:降低复杂性以初步了解数据集:

问题描述 投票:0回答:1

我在网站上有一份西班牙保险公司名单 - 按 24 个标题收集:请参阅以下内容

保险 - 西班牙语: 完整列表:https://www.unespa.es/en/directory

分为24页: https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z

想法 - 目标是什么:我想从页面中获取数据 - 使用 BS4 和请求 - 最后将其保存到数据框中: 嗯 - 使用 BeautifulSoup (BS4) 和 Python 中的请求从网站上抓取列表的任务似乎是合适的;我认为我们需要采取以下步骤:

a. 首先我们需要导入必要的库:BeautifulSoup、requests 和 pandas。 b. 然后我们需要使用 requests 库来获取每个感兴趣的页面的 HTML 内容:即 A 到 Z 页面。 c. 然后我使用 BeautifulSoup 来解析 HTML 内容。 d. 随后我认为下一步是从解析的 HTML 中提取相关信息(保险公司名称) e. 最后我想将提取的数据存储在 pandas DataFrame 中。

但这不起作用... - 也不适用于从 A 到 Z 的迭代:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/"

# List to store all insurers
all_insurers = []

# Loop through each page (A to Z)
for char in range(65, 91):  # ASCII codes for A to Z
    page_url = f"{base_url}#{chr(char)}"
    insurers = scrape_insurers(page_url)
    all_insurers.extend(insurers)

# Convert the list of insurers to a pandas DataFrame
df = pd.DataFrame({'Insurer': all_insurers})

# Display the DataFrame
print(df.head())

# Save DataFrame to a CSV file
df.to_csv('insurers_spain.csv', index=False)

...失败并显示以下结果:

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Failed to retrieve data from https://www.unespa.es/en/directory/#B
Failed to retrieve data from https://www.unespa.es/en/directory/#C
Failed to retrieve data from https://www.unespa.es/en/directory/#D
Failed to retrieve data from https://www.unespa.es/en/directory/#E

等等等等:

嗯,我认为首先减少复杂性的步骤会更容易。

我认为最好采用一个我想要访问的 URL。最好测试一下我们的请求返回的结果。完成后,现在我可以评估请求;好吧,我想我可以使用美丽的汤库来检查共同的特定字段。 好吧,我认为我应该避免一步做三件事(这显然可能是可怕的错误)。

所以我对第一个角色这样做:对于A:

import requests
from bs4 import BeautifulSoup

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/#"

# Define the character we want to fetch data for
char = 'A'

# Construct the URL for the specified character
url = base_url + char

# Fetch and print data for the specified character
insurers_char = scrape_insurers(url)
print(f"Insurers for character '{char}':")
print(insurers_char)

但请参阅此处的输出:

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Insurers for character 'A':
[]
python dataframe web-scraping beautifulsoup request
1个回答
0
投票

尝试:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.unespa.es/en/directory/"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

data = []
for c in soup.select(".contact-item"):
    for t in c.select("span, a"):
        t.unwrap()
    c.smooth()

    title, *other = c.get_text(separator="|||", strip=True).split("|||")
    data.append(
        {"Title": title, **{(s := d.split(":", maxsplit=1))[0]: s[1] for d in other}}
    )

df = pd.DataFrame(data)
print(df)

打印:

                                                                                      Title                         Tfno.                           Fax                                                         Web                                                                                           Dirección                                          Email
0                               A.M.A., AGRUPACIÓN MUTUAL ASEGURADORA, MUTUA DE SEGUROS APF                  91 343 47 00                (91) 343 47 68                                   http://www.amaseguros.com                                                              VÍA DE LOS POBLADOS, 3 28033  (MADRID)                                            NaN
1                                                  ABANCA GENERALES DE SEGUROS Y REASEGUROS         881920742 / 881920744                           NaN                                                         NaN                                                  AV. LINARES RIVAS 30, 3º 15005 A CORUÑA (A CORUÑA)                                            NaN
2                                     ABANCA VIDA Y PENSIONES DE SEGUROS Y REASEGUROS, S.A.                   981 188 075                           NaN                                                         NaN                                         AVENIDA DE LA MARINA, 1-3ª PLANTA 15001 A CORUÑA (A CORUÑA)                                            NaN
3                                          ADMIRAL EUROPE COMPAÑIA DE SEGUROS S.A.U. (AECS)                           NaN                           NaN                              https://www.admiraleurope.com/                                               RODRÍGUEZ MARÍN, 61 - 1ª PLANTA 28016 MADRID (MADRID)                                            NaN
4                                    AEGON ESPAÑA, SOCIEDAD ANÓNIMA DE SEGUROS Y REASEGUROS                  91 563 62 22                           NaN                                         http://www.aegon.es                 VÍA DE LOS POBLADOS, 3 - EDIFICIO 4B - PARQUE EMPRESARIAL CRISTALIA 28033  (MADRID)                                            NaN
5                                          AGROPELAYO SOCIEDAD DE SEGUROS, SOCIEDAD ANÓNIMA                           NaN                           NaN                                                         NaN                                                             SANTA ENGRACIA, 67 - 69 28010  (MADRID)                                            NaN


...
© www.soinside.com 2019 - 2024. All rights reserved.