从字符串中解析XML报告文件中的“垃圾”,我不知道如何找到它

问题描述 投票:1回答:1

我正在尝试使用元素树解析XML字符串。此字符串来自多个dict值,这些值连接在一起。没有根节点,但第一次运行良好。

我第一次这样做,它起作用了:

   for value in data.values():
        myxml = ' '.join(value)
        tree = ET.fromstring(myxml)

但是在相同情况下,只有另一本字典,它不起作用。我要做的代码很简单:

values = [x for x in dict_fasi.values()]
    myxml_fasi = ' '.join(values)
tree2 = ET.fromstring(myxml_fasi)

我也像以前一样尝试过循环,但是没有用。错误显示:xml.etree.ElementTree.ParseError:文档元素后出现垃圾:第8行,第20列

第8行应为:

</new_line> <new_line>

并且XML字符串是:

<new_line>
          <text font="NUMPTY+ImprintMTnum" bbox="297.284,540.828,300.188,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">della quale non conosce che una parte;] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="322.455,540.839,328.251,553.566" colourspace="DeviceGray" ncolour="0" size="12.727">prima</text>
          <text font="NUMPTY+ImprintMTnum" bbox="331.206,545.345,334.683,552.834" colourspace="DeviceGray" ncolour="0" size="7.489">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="177.602,528.028,180.850,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che nonconosce ancora appieno;</text>
          <text font="NUMPTY+ImprintMTnum" bbox="189.430,532.545,192.908,540.034" colourspace="DeviceGray" ncolour="0" size="7.489">2</text>
          <text font="NUMPTY+ImprintMTnum" bbox="203.879,528.028,208.975,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che</text>
        </new_line> <new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="315.109,462.272,319.863,472.957" colourspace="DeviceGray" ncolour="0" size="10.685">5</text>
          <text font="NUMPTY+ImprintMTnum" bbox="368.916,461.828,372.743,474.310" colourspace="DeviceGray" ncolour="0" size="12.482">avederci]</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="86.577,449.039,92.373,461.766" colourspace="DeviceGray" ncolour="0" size="12.727">sps.a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="167.611,449.028,172.707,461.510" colourspace="DeviceGray" ncolour="0" size="12.482">dove io andava a</text>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,421.672,72.786,432.357" colourspace="DeviceGray" ncolour="0" size="10.685">5</text>
          <text font="NUMPTY+ImprintMTnum" bbox="137.296,421.228,140.200,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">tante libertà] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="161.868,421.239,167.664,433.966" colourspace="DeviceGray" ncolour="0" size="12.727">prima</text>
          <text font="NUMPTY+ImprintMTnum" bbox="170.784,425.745,174.262,433.234" colourspace="DeviceGray" ncolour="0" size="7.489">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="174.297,421.228,183.920,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
          <text font="MUVAOR+Symbol" bbox="194.367,421.612,199.376,431.672" colourspace="DeviceGray" ncolour="0" size="10.060">&lt;&gt;</text>
          <text font="NUMPTY+ImprintMTnum" bbox="208.349,425.745,211.827,433.234" colourspace="DeviceGray" ncolour="0" size="7.489">2</text>
          <text font="NUMPTY+ImprintMTnum" bbox="244.601,421.228,250.976,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">certe lib</text>
          <text font="MUVAOR+Symbol" bbox="250.901,421.612,255.910,431.672" colourspace="DeviceGray" ncolour="0" size="10.060">&lt;</text>
          <text font="NUMPTY+ImprintMTnum" bbox="269.331,421.228,274.426,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">ertà</text>
          <text font="MUVAOR+Symbol" bbox="274.363,421.612,279.373,431.672" colourspace="DeviceGray" ncolour="0" size="10.060">&gt;</text>
        </new_line> <new_line>

第一个工作的XML字符串是这样的:

<new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="234.782,118.872,239.536,129.558" colourspace="DeviceGray" ncolour="0" size="10.685">80</text>
          <text font="NUMPTY+ImprintMTnum" bbox="360.280,118.428,363.184,130.911" colourspace="DeviceGray" ncolour="0" size="12.482">pazienza, e la prudenza.] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="369.339,118.440,375.135,131.167" colourspace="DeviceGray" ncolour="0" size="12.727">da</text>
          <text font="NUMPTY+ImprintMTnum" bbox="113.588,105.629,118.684,118.111" colourspace="DeviceGray" ncolour="0" size="12.482">pa-zienza</text>
          <text font="MUVAOR+Symbol" bbox="120.415,105.707,124.422,117.543" colourspace="DeviceGray" ncolour="0" size="11.835">=</text>
        </new_line>
<new_line>
          <text font="NUMPTY+ImprintMTnum" bbox="194.095,105.629,196.999,118.111" colourspace="DeviceGray" ncolour="0" size="12.482">Cristoforo] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="214.031,105.640,219.827,118.367" colourspace="DeviceGray" ncolour="0" size="12.727">sts.a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="241.600,81.508,247.396,93.991" colourspace="DeviceGray" ncolour="0" size="12.482">Galdino 72</text>
          <text font="SZWUPJ+ImprintExpertMT" bbox="272.785,614.422,276.490,625.380" colourspace="DeviceGray" ncolour="0" size="10.958">  </text>
          <text font="NUMPTY+ImprintMTnum" bbox="53.923,592.408,58.102,602.646" colourspace="DeviceGray" ncolour="0" size="10.238">34c</text>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="72.640,592.472,77.394,603.157" colourspace="DeviceGray" ncolour="0" size="10.685">80</text>
          <text font="NUMPTY+ImprintMTnum" bbox="187.701,592.028,190.605,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">troverà … immaginare] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="201.265,592.039,204.169,604.766" colourspace="DeviceGray" ncolour="0" size="12.727">da </text>
          <text font="NUMPTY+ImprintMTnum" bbox="305.701,592.028,310.796,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">qualche rimedio inaspe</text>
          <text font="MUVAOR+Symbol" bbox="310.691,592.412,315.701,602.472" colourspace="DeviceGray" ncolour="0" size="10.060">&lt;</text>
          <text font="NUMPTY+ImprintMTnum" bbox="331.518,592.028,337.314,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">ttato</text>
          <text font="MUVAOR+Symbol" bbox="337.154,592.412,342.163,602.472" colourspace="DeviceGray" ncolour="0" size="10.060">&gt;</text>
        </new_line>

也许是new_line标签的打开和关闭的问题,但我不知道如何解决。

python xml elementtree
1个回答
1
投票

错误消息中的“垃圾”一词似乎是一种相当不公平的价值判断;但是,这意味着解析器希望看到单个顶级元素,并且到达该元素的末尾(以及所有尾随注释或PI)时,它希望看到文件的末尾。如果还有另一个元素开始标记,则它不是格式正确的XML文档。

© www.soinside.com 2019 - 2024. All rights reserved.