使用Python和Beautifulsoup用<h2>标记返回的文本，其中原始网页中存在标题2

Question

我正在从网站上抓取一篇文章，并希望尽可能保留文本的原始格式。我的想法是让 beautifulsoup 返回所有文本，并使用一些 Python 代码在适当的文本片段周围输入“h2”标签，这些文本在原始网页中被定义为“h2”。但我在编码时遇到问题。

假设我正在抓取的 HTML 看起来有点像这样：

<section name="articleBody">
  <h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 1</h2>
  <p>Text on Point 1</p>
  <h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 2</h2>
  <p>Text on Point 2</p>
  <h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 3</h2>
  <p>Text on Point 3</p>
</articleBody>

我想要的输出是：

*<h2> Point 1 </h2>
Text on Point 1
<h2> Point 2 </h2>
Text on Point 2
<h2> Point 3 </h2>
Text on Point 3*

我尝试过以下代码：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import re
import requests
import os

url = http://www.test.com
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(url)

# Parse HTML
soup = BeautifulSoup(driver.page_source, "html.parser")

# Extract article text and include image tags
article_body = soup.find("section", {"name": "articleBody"})
article_text = ""

# Iterate through each element in article_body
def process_elements(elements):
    global article_text
    for element in elements:
        if element.name == "div":  # Check for div elements
            if "h2" in element.find_all(True):  # Check if div contains h2
                article_text += f"<h2>{element.find('h2').get_text()}</h2>\n"
            else:
                article_text += element.get_text(strip=True) + "\n"
            process_elements(element.children)  # Recurse through nested elements
        else:
            # Handle non-div elements directly
            if element.name == "h2":
                article_text += f"<h2>{element.get_text()}</h2>\n"
            else:
                article_text += element.get_text(strip=True) + "\n"


# Apply the processing function to article_body and its children
process_elements(article_body.children)

print(article_text)

但是，通过上述方法，虽然它正确输出带有 h2s 标记的文本，但它会重复段落文本的每一行。

Answer 1

如果我理解正确的话，您想通过相应的

<p>

标题标签对

<h2>

元素进行分组。这是一个如何做到这一点的示例：

from bs4 import BeautifulSoup

html_code = """\
<section name="articleBody">
  <h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 1</h2>
  <p>Text on Point 1</p>
  <h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 2</h2>
  <p>Text on Point 2</p>
  <h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 3</h2>
  <p>Text on Point 3</p>
</articleBody>"""

soup = BeautifulSoup(html_code, "html.parser")

# find appropriate section (<h2> tags followed by <p>)

out = {}
for p_tag in soup.select(":has(> h2 ~ p) p"):
    prev_h2 = p_tag.find_previous("h2")
    t = p_tag.get_text(strip=True, separator="\n")

    if not prev_h2:
        out[None] = t
    else:
        out[prev_h2.get_text(strip=True)] = t

print(out)

打印：

{
  'Point 1': 'Text on Point 1', 
  'Point 2': 'Text on Point 2', 
  'Point 3': 'Text on Point 3'
}

使用Python和Beautifulsoup用<h2>标记返回的文本，其中原始网页中存在标题2

问题描述投票：0回答：1

1个回答

最新问题

使用Python和Beautifulsoup用<h2>标记返回的文本，其中原始网页中存在标题2

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1