当我尝试提取该网站的所有网址时出现错误。我能做什么?

问题描述 投票:0回答:1
Getting this error - 
    hostIP = socket.gethostbyname(hostonly)
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)

我尝试跳过标签为空或太长的 URL。抓取大多数 URL 后会发生错误

我运行了代码 -

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, quote

def scrape(site):
    visited = set()
    queue = [site]

    while queue:
        current_url = queue.pop(0)
        if current_url in visited:
            continue

        visited.add(current_url)

        try:
            r = requests.get(current_url, timeout=5)
        except requests.exceptions.RequestException as e:
            print(f"Error connecting to {current_url}: {e}")
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        for link in soup.find_all("a"):
            href = link.get("href")
            if href is not None:
                full_url = urljoin(site, quote(href, safe='/:?=&'))
                if site in full_url and full_url not in visited:
                    queue.append(full_url)
                    print(full_url)

if __name__ == "__main__":
    site = "http://www.nuigalway.ie"
    scrape(site)

我想要网站所有网址的列表。

python web-scraping url beautifulsoup
1个回答
0
投票

我运行了你的代码,发现有一个链接导致了这个问题:

http://www.universityofgalway.ie/sustainability/
http
。不过,它确实应该是
https

因此,您可以简单地使用 Try/Except 来查看是否有

UnicodeError
,然后将 URL 替换为
https
(添加的 s

visited.add(current_url)

try:
    r = requests.get(current_url, timeout=5)
except UnicodeError:
    r = requests.get(current_url.replace('http', 'https'), timeout=5)
...
© www.soinside.com 2019 - 2024. All rights reserved.