如何使用Python从网站中提取url?

问题描述 投票:0回答:1

我正在为大学项目工作,我想使用网络抓取和文本挖掘来分析最受欢迎的电视节目的特征。

因此,我尝试使用以下 Python 代码从该网站列表中的每个电视节目中抓取 url https://www.imdb.com/chart/toptv/,但在输出中我只收到网站 url .

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.imdb.com/chart/toptv/m"
df = pd.DataFrame()
links = []
def extract_links(url):
    print("source url",url)
    global links
    source_url = requests.get(url)
    soup = BeautifulSoup(source_url.content,"html.parser")
    for link in soup.find_all('a',href=True):
        try:
            if len(links) >=100:
                return
            if link.get('href').startswith("https://") and link.get("href") not in links:
                links.append(link.get('href'))
                extract_links(link.get('href'))

        except Exception as e:
            print("Unhandled exception",e)

extract_links(url)
df = pd.DataFrame({"links":links})
df.to_csv("links.csv")

我也在网上搜索了一下,找到了这段代码,但也不起作用。

import requests
from bs4 import BeautifulSoup

# send a GET request to the website
url = 'https://www.imdb.com/chart/toptv/'
response = requests.get(url)

# parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# find all links on the page
links = soup.find_all('a')

# print the href attribute of each link
for link in links:
    print(link.get('href'))

有人可以帮助我并告诉我我做错了什么吗?

python web-scraping beautifulsoup python-requests
1个回答
0
投票

如上所述,该网站会检查您的请求标头并期望出现

user-agent
,因此请尝试提供信息以避免
403 error

要直接通过模式选择您要查找的链接,您还可以使用

css selectors
:

soup.select('a[href^="/title"]:has(h3)')
示例
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.imdb.com/chart/toptv/"

soup = BeautifulSoup(
    requests.get(url,headers={'user-agent':'some-agent'}).content,
    "html.parser"
)

pd.DataFrame(
    [
        {
            'link':f'https://www.imdb.com{link.get("href")}',
            'title': link.text.split('. ')[-1]
        }
        for link in soup.select('a[href^="/title"]:has(h3)')
    ]
)
© www.soinside.com 2019 - 2024. All rights reserved.