在 Python 中的标记上分割文本

Question

我有以下一行文字：

<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>

使用Python，我想打破标记实体以获得以下列表：

['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ ', '<pre class="mermaid">', 'stuff', '</pre>']

到目前为止，我用过：

markup = re.compile(r"(<(?P<tag>[a-z]+).*>)(.*?)(<\/(?P=tag)>)")
text = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'
words = re.split(markup, text)

但是它产生了：

['<code>', 'code', 'stuff', '</code>', ' and stuff and $\\LaTeX$ ', '<pre class="mermaid">', 'pre', 'stuff', '</pre>']

问题是

(?P=tag)

组被添加到列表中，因为它已被捕获。我捕获它只是为了获得最接近的结束标签。

假设代码一次只处理一行，我怎样才能在结果列表中删除它？

Answer 1

您可以使用

xml

，这是为

xml files

设计的模块，与

html

同义。

import xml.etree.ElementTree as ET

text = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'

root = ET.fromstring(f'<root>{text}</root>')

result = []

for element in root:
    if element.tag:
        result.append(f'<{element.tag}>')
    if element.text:
        result.extend(element.text.split())
    if element.tail:
        result.append(element.tail)

print(result)

Answer 2

RegEx 不适合解析 HTML。然而，它通常足以进行标记化。使用

re.finditer

，标记化就变成了一句话：

list(map(lambda x: x.group(0), re.finditer(r"(?:<(?:.*?>)?)|[^<]+", s)))

说明：

仅使用非捕获组
```
(?:...)
```
；我们在这里不需要特定的捕获。
匹配“标签”
```
<(?:.*?>)?
```
（可能无效（只是
```
<
```
符号），只能通过其开头
```
<
```
识别，直到
```
>
```
）或纯文本
```
[^<]+
```
。

这可以处理您的测试用例

s = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'

正确，生产

['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

但请注意，成熟的 HTML 分词器需要更复杂的正则语法来处理例如正确地使用诸如

onclick = "console.log(1 < 2)"

之类的属性。您最好使用现成的库来为您进行标记解析（甚至只是标记化）。

Answer 3

s = r'<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'

l = []

for i in range(len(s)):
    if s[i] == ">":
        l[-1] += s[i]
        l.append("")
    elif s[i] == "<":
        l.append("")
        l[-1] += s[i]
    else:
        l[-1] += s[i]
        
l.pop()
print(l)

输出：

['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

在 Python 中的标记上分割文本

问题描述投票：0回答：3

3个回答

最新问题

在 Python 中的标记上分割文本

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3