BeautifulSoup在不同的段落中加入单词

Question

我有一个我需要使用的EPUB文件。我正在尝试从文件中存在的HTML文件中提取文本。当我在提取的HTML内容上运行soup.get_text()时，所有段落都会连接在一起，将单词组合在一起。

我试过用空格替换所有 和标签。我也尝试将解析器从html.parser更改为html5lib。

with self._epub.open(html_file) as chapter:
    html_content = chapter.read().decode('utf-8')
    html_content = html_content.replace('</br>', ' ')
    html_content = html_content.replace('<br>', ' ')
    soup = bs4.BeautifulSoup(html_content, features="html5lib")
    clean_content = soup.get_text()

输入HTML：

Paragraph1。 1线

Line 2

预期产量：

1款。第1行第2行

实际产量：第1段。 Line1Line2

Answer 1

你可以这样做。一旦你得到HTML。

from bs4 import BeautifulSoup

html='''<p>Paragraph1. Line 1</p><p>Line 2<p>'''

    soup=BeautifulSoup(html,'html.parser')
    itemtext=''
    for item in soup.select('p'):
        itemtext+=item.text + ' '

    print(itemtext.strip())

输出：

Paragraph1. Line 1 Line 2

BeautifulSoup在不同的段落中加入单词

问题描述投票：0回答：1

1个回答

最新问题

BeautifulSoup在不同的段落中加入单词

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1