如何抓取网站中的所有网页以获取特定内容

问题描述 投票:0回答:1

我正在尝试在网站上查找带有关键字的所有个人资料。 因此,我正在寻找以牙买加为第二国籍的足球运动员,这是无法通过网站进行简单查询来完成的。 配置文件以以下格式存储: https://www.soccerdonna.de/en/[PERSON_NAME]/profile/[VARIABLE].html 例如 https://www.soccerdonna.de/en/liya-brooks/profil/spieler_69968.html

对于每个个人资料,我正在寻找关键字“牙买加”。

我尝试过 beautifulsoap 库,但它没有找到所有配置文件,它只找到提供给它的特定页面中的关键字,并且不会返回任何内容。

import requests
from bs4 import BeautifulSoup


def extract_profiles_with_keyword(url, keyword):
    # Fetch webpage content
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        profiles = []

        # Find all profile elements
        profile_elements = soup.find_all('div', class_='table-responsive')

        # Iterate through profile elements
        for profile_element in profile_elements:
            # Check if keyword is present in profile text
            if keyword.lower() in profile_element.text.lower():
                profiles.append(profile_element.text.strip())

        return profiles
    else:
        print("Failed to fetch the webpage.")
        return None


# URL of the website
url = "https://www.soccerdonna.de/en"

# Keyword to search for
keyword = "Jamaica"

# Extract profiles with the keyword
profiles = extract_profiles_with_keyword(url, keyword)

# Print the profiles
if profiles:
    for profile in profiles:
        print(profile)
else:
    print("No profiles found.")
python web-scraping
1个回答
0
投票

这里是一个示例,如果您不知道某个玩家的名字,您可以如何获取该玩家的信息(您必须在 URL 中增加 spieler_):

import requests
from bs4 import BeautifulSoup

base_url = "https://www.soccerdonna.de/en/NOT-IMPORTANT/profil/spieler_{number}.html"

for number in range(1, 10):
    url = base_url.format(number=number)
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    name = soup.select_one(".tabelle_spieler h1")
    print(url)
    print(name.text.strip())

    for tr in soup.select(".tabelle_grafik .tabelle_spieler tr"):
        k, v = [
            td.get_text(strip=True, separator=" ").strip(":") for td in tr.select("td")
        ]
        print(f"{k:<50} {v}")
    print()

打印:


...

https://www.soccerdonna.de/en/NOT-IMPORTANT/profil/spieler_7.html
Danesha Adams
Date of birth                                      06.06.1986
Place of birth                                     Bellflower, California
Age                                                37
Name in native country                             Danesha La Vonne Adams
Height                                             1,68
Nationality                                        United States
Position                                           Striker
Foot                                               right
Last match                                         07.11.2015 for Medkila

https://www.soccerdonna.de/en/NOT-IMPORTANT/profil/spieler_8.html
Claudia Aelker
Date of birth                                      25.01.1971
Age                                                53
Height                                             1,63
Nationality                                        Germany
Position                                           Defence
Last match                                         05.09.2004 for Heike Rheine

...

您可以将此信息存储到 pandas DataFrame 例如,将其另存为 CSV 并作为最后一步进行过滤(例如单词“牙买加”)

© www.soinside.com 2019 - 2024. All rights reserved.