无法解决Python抓取代码问题

问题描述 投票:0回答:2

我正在使用 python/bs 来抓取此网站:https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/index.html

有两个烦人的问题我无法解决:

  1. 它始终包含 '解决协议,/hipaa/for-professionals/compliance-enforcement/agreements/index.html' 作为 CSV 的第一个条目。

  2. 该网站的实际第一个条目从未包含在内。在本案中,“HHS 民权办公室以 475 万美元和解恶意内部网络安全调查”

有什么建议吗?

import requests
from bs4 import BeautifulSoup
import csv

# URL of the page to scrape
url = "https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/index.html"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all <a> tags containing the links
links = soup.find_all('a')

# Create and open a CSV file in write mode
with open('hipaa_links.csv', mode='w', newline='', encoding='utf-8') as file:
    # Create a CSV writer object
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['Title', 'URL'])

    # Iterate over each link
    for link in links:
        # Extract link URL
        link_url = link.get('href')

        # Extract link title
        link_title = link.text.strip()

        # Check if the link is related to "Resolution Agreements" or subsequent ones and if link text and URL are not empty
        if link_url and link_title and "/hipaa/for-professionals/compliance-enforcement/agreements/" in link_url:
            # Write the link title and URL to the CSV file
            writer.writerow([link_title, link_url])

print("Data has been written to hipaa_links.csv")
python web-scraping beautifulsoup
2个回答
1
投票
  1. 问题是你的选择不太具体,尝试集中

    a
    来刮:

    soup.select('section.usa-section ul a[href*="hipaa/for-professionals/compliance-enforcement/agreements"]')
    
  2. 这里的问题是第一个链接不符合您的陈述:

    https://www.hhs.gov/about/news/2024/02/06/hhs-office-civil-rights-settles-malicious-insider-cybersecurity-investigation.html

不含:

"/hipaa/for-professionals/compliance-enforcement/agreements/"

这就是 int 从未出现的原因。


0
投票

“解决协议”链接来自左侧边栏/导航,第一个链接(HHS 办公室..解决...)转到

/about/news/
页面。

我建议您将搜索过滤到最里面的

l-content
div,并删除链接目标上的过滤器:

content_divs = soup.find_all('div', class_='l-content')
content_div = content_divs[-1]
links = content_div.find_all('a')
...
        if link_url and link_title:
© www.soinside.com 2019 - 2024. All rights reserved.