我正在使用 python/bs 来抓取此网站:https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/index.html
有两个烦人的问题我无法解决:
它始终包含 '解决协议,/hipaa/for-professionals/compliance-enforcement/agreements/index.html' 作为 CSV 的第一个条目。
该网站的实际第一个条目从未包含在内。在本案中,“HHS 民权办公室以 475 万美元和解恶意内部网络安全调查”
有什么建议吗?
import requests
from bs4 import BeautifulSoup
import csv
# URL of the page to scrape
url = "https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/index.html"
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all <a> tags containing the links
links = soup.find_all('a')
# Create and open a CSV file in write mode
with open('hipaa_links.csv', mode='w', newline='', encoding='utf-8') as file:
# Create a CSV writer object
writer = csv.writer(file)
# Write the header row
writer.writerow(['Title', 'URL'])
# Iterate over each link
for link in links:
# Extract link URL
link_url = link.get('href')
# Extract link title
link_title = link.text.strip()
# Check if the link is related to "Resolution Agreements" or subsequent ones and if link text and URL are not empty
if link_url and link_title and "/hipaa/for-professionals/compliance-enforcement/agreements/" in link_url:
# Write the link title and URL to the CSV file
writer.writerow([link_title, link_url])
print("Data has been written to hipaa_links.csv")
问题是你的选择不太具体,尝试集中
a
来刮:
soup.select('section.usa-section ul a[href*="hipaa/for-professionals/compliance-enforcement/agreements"]')
这里的问题是第一个链接不符合您的陈述:
不含:
"/hipaa/for-professionals/compliance-enforcement/agreements/"
这就是 int 从未出现的原因。
“解决协议”链接来自左侧边栏/导航,第一个链接(HHS 办公室..解决...)转到
/about/news/
页面。
我建议您将搜索过滤到最里面的
l-content
div,并删除链接目标上的过滤器:
content_divs = soup.find_all('div', class_='l-content')
content_div = content_divs[-1]
links = content_div.find_all('a')
...
if link_url and link_title: