Getting this error -
hostIP = socket.gethostbyname(hostonly)
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)
我尝试跳过标签为空或太长的 URL。抓取大多数 URL 后会发生错误
我运行了代码 -
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, quote
def scrape(site):
visited = set()
queue = [site]
while queue:
current_url = queue.pop(0)
if current_url in visited:
continue
visited.add(current_url)
try:
r = requests.get(current_url, timeout=5)
except requests.exceptions.RequestException as e:
print(f"Error connecting to {current_url}: {e}")
continue
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.find_all("a"):
href = link.get("href")
if href is not None:
full_url = urljoin(site, quote(href, safe='/:?=&'))
if site in full_url and full_url not in visited:
queue.append(full_url)
print(full_url)
if __name__ == "__main__":
site = "http://www.nuigalway.ie"
scrape(site)
我想要网站所有网址的列表。
我运行了你的代码,发现有一个链接导致了这个问题:
http://www.universityofgalway.ie/sustainability/
和http
。不过,它确实应该是https
。
因此,您可以简单地使用 Try/Except 来查看是否有
UnicodeError
,然后将 URL 替换为 https
(添加的 s)
visited.add(current_url)
try:
r = requests.get(current_url, timeout=5)
except UnicodeError:
r = requests.get(current_url.replace('http', 'https'), timeout=5)
...