无法删除抓取数据上的标头

Question

我有以下代码可以抓取此网站：https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/index.html

它会抓取页面下方四分之一处的链接和标题。此外，它还会打开每个链接并抓取每个页面上的所有文本。以下是其中一个页面的示例：https://www.hhs.gov/about/news/2024/02/06/hhs-office-civil-rights-settles-malicious-insider-cybersecurity-investigation.html

它运作良好，除了所有描述字段都以此开头：“美国政府的官方网站，您是这样知道的官方网站使用 .gov .gov 网站属于美国官方政府组织。

安全的 .gov 网站使用 HTTPS 锁（LockA 锁定挂锁）或 https:// 表示您已安全连接到 .gov 网站。仅在官方安全网站上分享敏感信息。

关于如何摆脱它有什么建议吗？我似乎无法弄清楚。

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin***

# Function to scrape text from a given URL
def scrape_page_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Find all paragraphs and concatenate their text
    paragraphs = soup.find_all('p')
    text = ' '.join([p.get_text() for p in paragraphs])
    return text

# Base URL
base_url = "https://www.hhs.gov/hipaa/for-professionals/compliance-enforcement/agreements/"

# URL of the page to scrape
url = urljoin(base_url, "index.html")

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find the innermost l-content div
content_divs = soup.find_all('div', class_='l-content')
content_div = content_divs[-1]

# Find all <a> tags containing the links within the innermost div
links = content_div.find_all('a')

# Create and open a CSV file in write mode
with open('hipaa_links.csv', mode='w', newline='\n', encoding='utf-8') as file:
    # Create a CSV writer object
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['Title', 'URL', 'Description'])

    # Iterate over each link
    for link in links:
        # Extract link URL
        link_url = urljoin(base_url, link.get('href'))

        # Extract link title
        link_title = link.text.strip()

        # Scrape text from the linked page
        description = scrape_page_text(link_url)

        # Check if both link URL and title exist
        if link_url and link_title:
            writer.writerow([link_title, link_url, description])

print("Data has been written to hipaa_links.csv")

Answer 1

这里更新了

scrape_page_text()

函数，仅获取内容的文本：

# Function to scrape text from a given URL
def scrape_page_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all paragraphs and concatenate their text
    paragraphs = soup.select(".l-content p")
    text = " ".join([p.get_text() for p in paragraphs])
    return text

无法删除抓取数据上的标头

问题描述投票：0回答：1

1个回答

最新问题

无法删除抓取数据上的标头

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1