我正在为大学项目工作,我想使用网络抓取和文本挖掘来分析最受欢迎的电视节目的特征。
因此,我尝试使用以下 Python 代码从该网站列表中的每个电视节目中抓取 url https://www.imdb.com/chart/toptv/,但在输出中我只收到网站 url .
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.imdb.com/chart/toptv/m"
df = pd.DataFrame()
links = []
def extract_links(url):
print("source url",url)
global links
source_url = requests.get(url)
soup = BeautifulSoup(source_url.content,"html.parser")
for link in soup.find_all('a',href=True):
try:
if len(links) >=100:
return
if link.get('href').startswith("https://") and link.get("href") not in links:
links.append(link.get('href'))
extract_links(link.get('href'))
except Exception as e:
print("Unhandled exception",e)
extract_links(url)
df = pd.DataFrame({"links":links})
df.to_csv("links.csv")
我也在网上搜索了一下,找到了这段代码,但也不起作用。
import requests
from bs4 import BeautifulSoup
# send a GET request to the website
url = 'https://www.imdb.com/chart/toptv/'
response = requests.get(url)
# parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# find all links on the page
links = soup.find_all('a')
# print the href attribute of each link
for link in links:
print(link.get('href'))
有人可以帮助我并告诉我我做错了什么吗?
user-agent
,因此请尝试提供信息以避免 403 error
。
css selectors
:
soup.select('a[href^="/title"]:has(h3)')
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.imdb.com/chart/toptv/"
soup = BeautifulSoup(
requests.get(url,headers={'user-agent':'some-agent'}).content,
"html.parser"
)
pd.DataFrame(
[
{
'link':f'https://www.imdb.com{link.get("href")}',
'title': link.text.split('. ')[-1]
}
for link in soup.select('a[href^="/title"]:has(h3)')
]
)