使用lxml进行iterparsing时无法识别标记

问题描述 投票:0回答:2

我对lxml有一个非常奇怪的问题,我尝试使用iterparse解析我的xml文件,如下所示:

for event, elem in etree.iterparse(input_file, events=('start', 'end')):
    if elem.tag == 'tuv' and event == 'start':
        if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
            if elem.find('seg') is not None:
                write_in_some_file
        elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
            if elem.find('seg') is not None:
                write_in_some_file

它非常简单并且工作得非常完美,很快它就会通过我的xml文件,如果一个元素检查语言属性是'en'还是'de',它会检查是否有一个孩子,如果是,它会写一个将值放入文件中

在文件中有一个<seg>似乎不存在,在执行elem.find('seg')时返回None,你可以在这里看到它,你在<seg>! keine Spalten und Ventile</seg>下面的上下文中有它。

我不明白为什么这个看似完美的标签会产生问题(因为我不能使用它的.text),请注意每个其他标签都能很好找到

<tu tuid="235084307" datatype="Text">
<prop type="score">1.67647</prop>
<prop type="score-zipporah">0.6683</prop>
<prop type="score-bicleaner">0.7813</prop>
<prop type="lengthRatio">0.740740740741</prop>
<tuv xml:lang="en">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! no gaps and valves</seg>
</tuv>
<tuv xml:lang="de">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! keine Spalten und Ventile</seg>
</tuv>
</tu>
python xml tags lxml iterparse
2个回答
1
投票

lxml docs有这个警告:

警告:在“开始”事件期间,元素的任何内容(例如后代,兄弟姐妹或文本)尚未可用且不应被访问。仅保证设置属性。

也许不是使用find()中的tu来获取seg元素,而是更改你的“if”语句以匹配seg和“end”事件。

您可以使用getparent()从父级xml:lang获取tu属性值。

示例(带有额外“tu”元素的“test.xml”用于测试)

<tus>
    <tu tuid="235084307" datatype="Text">
        <prop type="score">1.67647</prop>
        <prop type="score-zipporah">0.6683</prop>
        <prop type="score-bicleaner">0.7813</prop>
        <prop type="lengthRatio">0.740740740741</prop>
        <tuv xml:lang="en">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! no gaps and valves</seg>
        </tuv>
        <tuv xml:lang="de">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! keine Spalten und Ventile</seg>
        </tuv>
    </tu>
    <tu tuid="235084307A" datatype="Text">
        <prop type="score">1.67647</prop>
        <prop type="score-zipporah">0.6683</prop>
        <prop type="score-bicleaner">0.7813</prop>
        <prop type="lengthRatio">0.740740740741</prop>
        <tuv xml:lang="en">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! no gaps and valves #2</seg>
        </tuv>
        <tuv xml:lang="de">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! keine Spalten und Ventile #2</seg>
        </tuv>
    </tu>
</tus>

Python例如

from lxml import etree

for event, elem in etree.iterparse("test.xml", events=("start", "end")):

    if elem.tag == "seg" and event == "end":
        current_lang = elem.getparent().get("{http://www.w3.org/XML/1998/namespace}lang")
        if current_lang == "en":
            print(f"Writing en text \"{elem.text}\" to file...")
        elif current_lang == "de":
            print(f"Writing de text \"{elem.text}\" to file...")
        else:
            print(f"Unable to determine language. Not writing \"{elem.text}\" to any file.")

    if event == "end":
        elem.clear()

印刷输出

Writing en text "! no gaps and valves" to file...
Writing de text "! keine Spalten und Ventile" to file...
Writing en text "! no gaps and valves #2" to file...
Writing de text "! keine Spalten und Ventile #2" to file...

1
投票

我不确定这是不是你正在寻找的东西(我自己也很陌生),但是

for event, elem in etree.iterparse('xml_try.txt', events=('start', 'end')):
if elem.tag == 'tuv' and event == 'start':
    if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
        if elem.find('seg') is not None:
            print(elem[2].text)
    elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
        if elem.find('seg') is not None:
            print(elem[2].text)

生成此输出:

! no gaps and valves
! keine Spalten und Ventile

如果这不是你想要的,那么再次道歉。

© www.soinside.com 2019 - 2024. All rights reserved.