如何使用 Python 将标记添加到 XML 文本

问题描述 投票:0回答:3

我有一个 XML 格式的标记文本。我需要添加标记,即为文本中出现的某些单词添加标签。

这就是我正在尝试的方式:

import xml.etree.ElementTree as ET
doc = '''<root><par>An <fr>example</fr> text with key words one and two</par></root>'''

profs=['one','two']
tag='<key>'
tag_cl='</key>'

root = ET.fromstring(doc)
for child in root:
    for word in profs:
        if word in child.text:
            child.text=child.text.replace(word, f'{tag}{word}{tag_cl}')
    print(child.text)

如果文本中没有嵌套标签,这行得通。如果有标签(在本例中为“fr”),则 child.text 仅被视为第一个标签之前的文本。当然必须有一些简单的解决方案来执行我描述的任务。你能给我一个提示吗?

python xml elementtree
3个回答
0
投票

这里是任务的 XSLT 2.0 实现。

输入 XML

<?xml version="1.0"?>
<root>
    <par>An <fr>example</fr> text with key words one and two</par>
</root>

XSLT 2.0

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" encoding="utf-8"
                omit-xml-declaration="no"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="text()">
            <xsl:call-template name="OneTwoSequence"/>
      </xsl:template>

      <xsl:template name="OneTwoSequence">
            <xsl:param name="string" select="string(.)"/>
            <xsl:analyze-string select="$string" regex="one|two">
                  <xsl:matching-substring>
                        <key>
                              <xsl:value-of select="."/>
                        </key>
                  </xsl:matching-substring>
                  <xsl:non-matching-substring>
                        <xsl:value-of select="."/>
                  </xsl:non-matching-substring>
            </xsl:analyze-string>
      </xsl:template>
</xsl:stylesheet>

输出

<?xml version='1.0' encoding='utf-8' ?>
<root>
  <par>An 
    <fr>example</fr> text with key words 
    <key>one</key> and 
    <key>two</key>
  </par>
</root>

0
投票

你非常接近,但你必须使用 lxml 而不是 ElementTree 才能到达那里:

from lxml import html as lh
root = lh.fromstring(doc)

#locate relevant the element
target = root.xpath('//fr')[0]

#convert the relevant element to string and copy it to a new string
#that is a necessary step because we're going to have to delete the
#original string
target_str = lh.tostring(target).decode()

#make the necessary changes to the string
profs=['one','two']
for word in profs:
    if word in target_str:
        target_str = target_str.replace(word, f'<key>{word}</key>')    

#locate the destination for the new element
destination = root.xpath('//par')[0]
#remove the original target
destination.remove(target)
#insert the new string, converted into a new element
destination.insert(0,lh.fromstring(target_str))
print(lh.tostring(root))

输出应该是您的预期输出。


0
投票

你搜索尾元素。如有必要,您可以复制 tag.text 的 if 条件:

import xml.etree.ElementTree as ET
doc = '''<root><par>An <fr>example</fr> text with key words one and two</par></root>'''

profs=['one','two']
tag= ET.Element('key')

root = ET.fromstring(doc)

for elem in root.iter():
    #print(elem.text)
    #print(elem.tail)
    for word in profs:
        if elem.tail != None and word in elem.tail:
            tag.text=word
            elem.tail = elem.tail.replace(word, ET.tostring(tag).decode())
      
    if elem.tail != None:
        print(elem.tail)

输出:

text with key words <key>one</key> and <key>two</key>
© www.soinside.com 2019 - 2024. All rights reserved.