基于字符python在xml中合并行

问题描述 投票:0回答:1

我有多个xml文件,这些是PDF文档的xml版本。首先,我必须合并xml文件,然后读取以连字符结尾的单词。如果单词以连字符结尾,则将以XML创建一个单独的标签(TCL CHAR ='-'),我需要识别这些标签,并将前一行的最后一个单词与下一行的第一个单词合并称为的单独标签。我有以下用于合并的代码

def run(files):
first = None
for filename in files:
    data = ET.parse(filename).getroot()
    if first is None:
        first = data
    else:
        first.extend(data)
if first is not None:
    root = ET.tostring(first)
return root

以及以下用于单词合并的代码

beg_line_cont = [] end_line_cont = [] for block in root: for para in block: for line in para: for word in line: if word.tag == 'TC': line = word.text if word.tag == 'TCL' and word.attrib['CHAR']=='-': beg_line_cont.append(line) if word.tag == 'TC': line = word.text end_line_cont.append(line)

合并代码不起作用,我能够在TCL CHAR ='-'之前的上一行,但不能在下一行...有人可以协助吗?

XML文件示例在这里:

</PAR>
<LPAR PBDPL="[D]137[L]120" PBCAMGTI="[G]LP6[T]Lead VJ" STRIKE="0"></LPAR>
<PAR PBDPL="[D]3360[P]3m" PBCAMGTI="[G]I2AS[I]0" TAPARADV="[HYP]1" BLMODE="3" STRIKE="0" UNIQID="d180d82ee84ff937">
<LINE>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Diese Angebotsunterlage (die &#132;</TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Angebotsunterlage</TC>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>&#147;) beschreibt das freiwillige &#246;ffentliche &#220;bernahme</TC>
<TCL CHAR="-" WIDTH="67" CTLCHAR="-" CTLSTR="" TYPE="SYSTEMHYPHEN" VISIBLE="1" USE_SF_LDRVALUES="1"/></LINE>
<LINE>
<TC>angebot in Form eines Tauschangebots (das &#132;</TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Angebot</TC>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>&#147;) der ADO Properties S.A., einer Aktiengesell</TC>
<TCL CHAR="-" WIDTH="67" CTLCHAR="-" CTLSTR="" TYPE="SYSTEMHYPHEN" VISIBLE="1" USE_SF_LDRVALUES="1"/></LINE>
<LINE>
<TC>schaft nach luxemburgischem Recht </TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ITALIC" PNTSZSTR="" FONTNAME="" FACE="I" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>(soci&#233;t&#233; anonyme)</TC>
<FRMDEF NAME="ITALIC" PNTSZSTR="" FONTNAME="" FACE="I" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC> mit Sitz in Senningerberg, eingetragen im </TC>
</LINE>
<LINE>
python-3.x xml elementtree
1个回答
0
投票

遇到连字符时,您需要:

  1. 从上一行删除最后一个单词。
  2. 开始新行。
  3. 将下一行的第一个单词复制到新行,且之前没有空格。

想法是为新行保留一个“前缀”变量。即如果您有blah-<newline>blahblah,则将看到“ blah”,将其设置为下一行的前缀,当看到“ blahblah”时,将其连接到“ blah”。

text = prefix = ''
for block in root:
    for para in block:
        for line in para:
            for word in line:
                if word.tag == 'TC':
                    text += prefix + word.text
                    prefix = ''
                if word.tag == 'TCL' and word.attrib['CHAR']=='-':
                    # Find the last word
                    last_space_index = text.rfind(' ')
                    prefix = text[last_space_index + 1:]
                    text = text[:last_space_index]
© www.soinside.com 2019 - 2024. All rights reserved.