使用 A 标签从 Wikipedia 上的 Div 文本中从链接中抓取到列表 DataFrame BeautifulSoup

问题描述 投票:0回答:1

我正处于编码的初级阶段...尝试使用“a”标签从 div 中的歌曲链接中抓取文本。但是,我只能获取字母表中每个字母的第一首歌。我正在剥离文本而不是获取标题,因为某些链接在 html 中缺少标题。如果有人可以帮忙,谢谢!

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Category:Song_recordings_produced_by_John_Lennon'

data = requests.get(url)
soup = BeautifulSoup(data.content, "html.parser")
div = soup.find"div", {"class":"mw-category mw-category-columns"})


songs = []

for song in div:
    songs.append(song.find_next("a").text.strip())

print(songs)

输出:

['Air Talk', "Baby's Heartbeat", 'Cambridge 1969', 'Dear John (John Lennon song)', 
'Every Man Has a Woman Who Loves Him', 'F Is Not a Dirty Word', 'Gimme Some Truth', 
'Happy Xmas (War Is Over)', "I Don't Wanna Be a Soldier", 'Jamrag (song)', 
'Kiss Kiss Kiss (Yoko Ono song)', 'Listen, the Snow Is Falling', 'Many Rivers to Cross', 
'New York City (John Lennon and Yoko Ono song)', "O'Wind (Body Is the Scar of Your Mind)", 
'Paper Shoes', 'Radio Play (song)', 'Scared (John Lennon song)', 'Telephone Piece', 
'Waiting for the Sunrise (song)', 'Yang Yang (song)']
python pandas web-scraping beautifulsoup wikipedia
1个回答
0
投票

您可以使用此示例如何将所有 182 首歌曲添加到列表中:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Category:Song_recordings_produced_by_John_Lennon"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")


songs = [a.text for a in soup.select("#mw-pages li a")]

print(*songs, sep="\n")
print()
print(f"Songs total={len(songs)}")

打印:


...

Yellow Girl (Stand by for Life)
Yes, I'm Your Angel
You (Yoko Ono song)
You Are Here (song)
You're the One (Yoko Ono song)

Songs total=182
© www.soinside.com 2019 - 2024. All rights reserved.