request.exceptions.MissingSchema：无效的网址

Question

我试图抓取网页获取文章，但链接没有http：，所以我收到request.expections.MissingSchema：无效的URL错误。

我知道我必须尝试像'http：'+ href这样的东西，但我应该把它放在哪里，我无法理解。

import time

import requests

from bs4 import BeautifulSoup

url = 'https://mainichi.jp/english/search?q=cybersecurity&t=kiji&s=match&p={}'

pages = 6

for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".list-typeD li > a"):
        resp = requests.get(item.get("href"))
        sauce = BeautifulSoup(resp.text,"lxml")
        date = sauce.select(".post p")
        date = date[0].text
        title = sauce.select_one(".header-box h1").text
        content = [elem.text for elem in sauce.select(".main-text p")]
        print(f'{date}\n {title}\n {content}\n')

        time.sleep(3)

我将从所有页面获取所有文章的日期，标题和内容。

Answer 1

这是因为在声明中

resp = requests.get(item.get("href"))

您没有向有效的URL发送请求。 href标记可能包含相对URL，而不是绝对URL。请尝试在item.get（“href”）之前附加基本网址

这应该做：

resp = requests.get("https:"+item.get("href"))

request.exceptions.MissingSchema：无效的网址

问题描述投票：0回答：1

1个回答

最新问题

request.exceptions.MissingSchema：无效的网址

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1