在我下面列出的示例中,它是弗吉尼亚理工大学所有校友关系章节的页面。我想深入了解校友关系的每一章,并为列出的每条信息创建一个 CSV 文件。我尝试过使用BeautifulSoup,但没有成功。
非常感谢有关此主题的任何帮助,谢谢!
url=https://www.alumni.vt.edu/chapters/chapter_list.html
from bs4 import BeautifulSoup
import requests
website = 'https://www.alumni.vt.edu/chapters/chapter_list.html'
result = requests.get(website)
content = result.text
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify())
以下是如何抓取章节列表页面中找到的每个链接并从子页面获取一些信息的示例:
import requests
from bs4 import BeautifulSoup
url = "https://www.alumni.vt.edu/chapters/chapter_list.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = []
for a in soup.select(".general-body li > a"):
links.append(a["href"])
for u in links:
print(f"Opening {u}")
soup = BeautifulSoup(requests.get(u).content, "html.parser")
# get some info here:
contact = soup.select_one(".general-body strong:-soup-contains(Contact)")
if contact:
c = contact.next_element.next_element
c = c.text.strip()
print(contact.text, c)
打印:
Opening https://alumni.vt.edu/chapters/chapter_list/alleghany_highlands.html
Contact: Kathleen All
Opening https://alumni.vt.edu/chapters/chapter_list/augusta.html
Contact: [email protected]
Opening https://alumni.vt.edu/chapters/chapter_list/central_virginia.html
Contact: Sammy Paris
Opening https://alumni.vt.edu/chapters/chapter_list/charlottesville.html
Contact: Martin Harar
Opening https://alumni.vt.edu/chapters/chapter_list/commonwealth.html
Contact: Volunteers Needed
...