我正在从网站上抓取一篇文章,并希望尽可能保留文本的原始格式。我的想法是让 beautifulsoup 返回所有文本,并使用一些 Python 代码在适当的文本片段周围输入“h2”标签,这些文本在原始网页中被定义为“h2”。但我在编码时遇到问题。
假设我正在抓取的 HTML 看起来有点像这样:
<section name="articleBody">
<h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 1</h2>
<p>Text on Point 1</p>
<h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 2</h2>
<p>Text on Point 2</p>
<h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 3</h2>
<p>Text on Point 3</p>
</articleBody>
我想要的输出是:
*<h2> Point 1 </h2>
Text on Point 1
<h2> Point 2 </h2>
Text on Point 2
<h2> Point 3 </h2>
Text on Point 3*
我尝试过以下代码:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import re
import requests
import os
url = http://www.test.com
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(url)
# Parse HTML
soup = BeautifulSoup(driver.page_source, "html.parser")
# Extract article text and include image tags
article_body = soup.find("section", {"name": "articleBody"})
article_text = ""
# Iterate through each element in article_body
def process_elements(elements):
global article_text
for element in elements:
if element.name == "div": # Check for div elements
if "h2" in element.find_all(True): # Check if div contains h2
article_text += f"<h2>{element.find('h2').get_text()}</h2>\n"
else:
article_text += element.get_text(strip=True) + "\n"
process_elements(element.children) # Recurse through nested elements
else:
# Handle non-div elements directly
if element.name == "h2":
article_text += f"<h2>{element.get_text()}</h2>\n"
else:
article_text += element.get_text(strip=True) + "\n"
# Apply the processing function to article_body and its children
process_elements(article_body.children)
print(article_text)
但是,通过上述方法,虽然它正确输出带有 h2s 标记的文本,但它会重复段落文本的每一行。
如果我理解正确的话,您想通过相应的
<p>
标题标签对 <h2>
元素进行分组。这是一个如何做到这一点的示例:
from bs4 import BeautifulSoup
html_code = """\
<section name="articleBody">
<h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 1</h2>
<p>Text on Point 1</p>
<h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 2</h2>
<p>Text on Point 2</p>
<h2 id="link-a1b2c3d4" style="color:rgb(18, 18, 18)">Point 3</h2>
<p>Text on Point 3</p>
</articleBody>"""
soup = BeautifulSoup(html_code, "html.parser")
# find appropriate section (<h2> tags followed by <p>)
out = {}
for p_tag in soup.select(":has(> h2 ~ p) p"):
prev_h2 = p_tag.find_previous("h2")
t = p_tag.get_text(strip=True, separator="\n")
if not prev_h2:
out[None] = t
else:
out[prev_h2.get_text(strip=True)] = t
print(out)
打印:
{
'Point 1': 'Text on Point 1',
'Point 2': 'Text on Point 2',
'Point 3': 'Text on Point 3'
}