无法解决Python抓取代码问题

Question

我正在使用 python/bs 来抓取此网站：https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/index.html

有两个烦人的问题我无法解决：

它始终包含 '解决协议，/hipaa/for-professionals/compliance-enforcement/agreements/index.html' 作为 CSV 的第一个条目。
该网站的实际第一个条目从未包含在内。在本案中，“HHS 民权办公室以 475 万美元和解恶意内部网络安全调查”

有什么建议吗？

import requests
from bs4 import BeautifulSoup
import csv

# URL of the page to scrape
url = "https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/index.html"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all <a> tags containing the links
links = soup.find_all('a')

# Create and open a CSV file in write mode
with open('hipaa_links.csv', mode='w', newline='', encoding='utf-8') as file:
    # Create a CSV writer object
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['Title', 'URL'])

    # Iterate over each link
    for link in links:
        # Extract link URL
        link_url = link.get('href')

        # Extract link title
        link_title = link.text.strip()

        # Check if the link is related to "Resolution Agreements" or subsequent ones and if link text and URL are not empty
        if link_url and link_title and "/hipaa/for-professionals/compliance-enforcement/agreements/" in link_url:
            # Write the link title and URL to the CSV file
            writer.writerow([link_title, link_url])

print("Data has been written to hipaa_links.csv")

Answer 1

问题是你的选择不太具体，尝试集中

来刮：

soup.select('section.usa-section ul a[href*="hipaa/for-professionals/compliance-enforcement/agreements"]')

这里的问题是第一个链接不符合您的陈述：

https://www.hhs.gov/about/news/2024/02/06/hhs-office-civil-rights-settles-malicious-insider-cybersecurity-investigation.html

不含：

"/hipaa/for-professionals/compliance-enforcement/agreements/"

这就是 int 从未出现的原因。

Answer 2

“解决协议”链接来自左侧边栏/导航，第一个链接（HHS 办公室..解决...）转到

/about/news/

页面。

我建议您将搜索过滤到最里面的

l-content

div，并删除链接目标上的过滤器：

content_divs = soup.find_all('div', class_='l-content')
content_div = content_divs[-1]
links = content_div.find_all('a')
...
        if link_url and link_title:

无法解决Python抓取代码问题

问题描述投票：0回答：2

2个回答

最新问题

无法解决Python抓取代码问题

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2