使用 python 从具有源自原始链接的多个链接的站点中抓取数据的最佳方法是什么?

问题描述 投票:0回答:1

在我下面列出的示例中,它是弗吉尼亚理工大学所有校友关系章节的页面。我想深入了解校友关系的每一章,并为列出的每条信息创建一个 CSV 文件。我尝试过使用BeautifulSoup,但没有成功。

非常感谢有关此主题的任何帮助,谢谢!

Example of the data I am looking to scrape

url=https://www.alumni.vt.edu/chapters/chapter_list.html


from bs4 import BeautifulSoup
import requests

website = 'https://www.alumni.vt.edu/chapters/chapter_list.html'

result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, 'lxml')

print(soup.prettify())
python web-scraping
1个回答
0
投票

以下是如何抓取章节列表页面中找到的每个链接并从子页面获取一些信息的示例:

import requests
from bs4 import BeautifulSoup

url = "https://www.alumni.vt.edu/chapters/chapter_list.html"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

links = []
for a in soup.select(".general-body li > a"):
    links.append(a["href"])

for u in links:
    print(f"Opening {u}")
    soup = BeautifulSoup(requests.get(u).content, "html.parser")

    # get some info here:
    contact = soup.select_one(".general-body strong:-soup-contains(Contact)")
    if contact:
        c = contact.next_element.next_element
        c = c.text.strip()

        print(contact.text, c)

打印:

Opening https://alumni.vt.edu/chapters/chapter_list/alleghany_highlands.html
Contact: Kathleen All          
Opening https://alumni.vt.edu/chapters/chapter_list/augusta.html                                    
Contact:  [email protected]                                                                  
Opening https://alumni.vt.edu/chapters/chapter_list/central_virginia.html       
Contact:  Sammy Paris                                                                                                                                                                                              
Opening https://alumni.vt.edu/chapters/chapter_list/charlottesville.html 
Contact: Martin Harar                                                                                    
Opening https://alumni.vt.edu/chapters/chapter_list/commonwealth.html
Contact:  Volunteers Needed             

...
© www.soinside.com 2019 - 2024. All rights reserved.