如何使用Python从网站中提取url？

Question

我正在为大学项目工作，我想使用网络抓取和文本挖掘来分析最受欢迎的电视节目的特征。

因此，我尝试使用以下 Python 代码从该网站列表中的每个电视节目中抓取 url https://www.imdb.com/chart/toptv/，但在输出中我只收到网站 url .

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.imdb.com/chart/toptv/m"
df = pd.DataFrame()
links = []
def extract_links(url):
    print("source url",url)
    global links
    source_url = requests.get(url)
    soup = BeautifulSoup(source_url.content,"html.parser")
    for link in soup.find_all('a',href=True):
        try:
            if len(links) >=100:
                return
            if link.get('href').startswith("https://") and link.get("href") not in links:
                links.append(link.get('href'))
                extract_links(link.get('href'))

        except Exception as e:
            print("Unhandled exception",e)

extract_links(url)
df = pd.DataFrame({"links":links})
df.to_csv("links.csv")

我也在网上搜索了一下，找到了这段代码，但也不起作用。

import requests
from bs4 import BeautifulSoup

# send a GET request to the website
url = 'https://www.imdb.com/chart/toptv/'
response = requests.get(url)

# parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# find all links on the page
links = soup.find_all('a')

# print the href attribute of each link
for link in links:
    print(link.get('href'))

有人可以帮助我并告诉我我做错了什么吗？

Answer 1

如上所述，该网站会检查您的请求标头并期望出现

user-agent

，因此请尝试提供信息以避免

403 error

。

要直接通过模式选择您要查找的链接，您还可以使用

css selectors

:

soup.select('a[href^="/title"]:has(h3)')

示例

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.imdb.com/chart/toptv/"

soup = BeautifulSoup(
    requests.get(url,headers={'user-agent':'some-agent'}).content,
    "html.parser"
)

pd.DataFrame(
    [
        {
            'link':f'https://www.imdb.com{link.get("href")}',
            'title': link.text.split('. ')[-1]
        }
        for link in soup.select('a[href^="/title"]:has(h3)')
    ]
)

如何使用Python从网站中提取url？

问题描述投票：0回答：1

1个回答

示例

最新问题

如何使用Python从网站中提取url？

问题描述 投票：0回答：1

1个回答

示例

最新问题

问题描述投票：0回答：1