从简单的html提取文本和(hlStart和hlEnd)标签

问题描述 投票:0回答:3

我有html / xml文件的以下部分:

<p><hlstart ana="#ann224094"></<hlstart>Przed<hlend ana="#ann224094"></hlend> <hlstart ana="#ann224160"></hlstart>nami <hlend ana="#ann224160"></hlend>jeszcze trzy <hlstart ana="#ann224159"></hlstart>dni,<hlend ana="#ann224159"></hlend></p>

我想提取文本和标签以将它们排列在表格中,例如:

text, nonana
text, ana

其中ana表示标签,例如来自#ann224094的标签>

<hlstart ana="#ann224094"></<hlstart>Przed<hlend ana="#ann224094"></hlend> 

并且nonana表示文本没有ana标记。

przed, #ann224094
nami, #ann224160
jeszcze trzy, nonana

我已经尝试将bs4和htmlparser与xml数据的其他部分一起使用,但是我不理解该部分。我可以使用.text方法导出整个文本,所有字符串,但是我需要知道哪些单词具有ana标签。此外,所有带有ana标记的单词稍后都会在我的文件中带有特定标签。

我有html / xml文件的以下部分:

Przed

]
from bs4 import BeautifulSoup

txt = '''<p><hlstart ana="#ann224094"></hlstart>Przed<hlend ana="#ann224094"></hlend> <hlstart ana="#ann224160"></hlstart>nami <hlend ana="#ann224160"></hlend>jeszcze trzy <hlstart ana="#ann224159"></hlstart>dni,<hlend ana="#ann224159"></hlend></p>'''

soup = BeautifulSoup(txt, 'html.parser')

out = []
for t in soup.find_all(text=True):
    if t.strip() == '':
        continue

    prev = t.find_previous_sibling()
    if prev.name == 'hlstart':
        out.append( (t, prev['ana']) )
    else:
        out.append( (t, 'noana') )

# print it to screen:
from pprint import pprint
pprint(out)

打印:

[('Przed', '#ann224094'),
 ('nami ', '#ann224160'),
 ('jeszcze trzy ', 'noana'),
 ('dni,', '#ann224159')]

另一种方法,使用lxml:

    ana = """your html above"""
    import lxml.html as lh

    doc = lh.fromstring(ana)
    targets = doc.xpath('//hlstart[@ana]')
    nont = doc.xpath('//*[name() != "hlstart"]')

    for target in targets:
        if target.tail is not None:
            print(target.attrib['ana'],target.tail.strip())

    for n in nont:
        if n.tail is not None and len(n.tail.strip())>0:
           print('noanna ',n.tail.strip())

输出:

#ann224094 Przed
#ann224160 nami
#ann224159 dni,
noanna  jeszcze trzy

另一种方法,使用SimplifiedDoc:)

from simplified_scrapy import SimplifiedDoc,utils
html = '''
<p>
<hlstart ana="#ann224094"></hlstart>Przed<hlend ana="#ann224094"></hlend> 
<hlstart ana="#ann224160"></hlstart>nami <hlend ana="#ann224160"></hlend>
jeszcze trzy 
<hlstart ana="#ann224159"></hlstart>dni,<hlend ana="#ann224159"></hlend></p>'''

doc = SimplifiedDoc(html)
for h in doc.p.hlstarts:
    text = h.nextText()
    if text: print(h.ana,text)
for h in doc.p.hlends:
    text = h.nextText()
    if text: print('noana',text)

结果:

#ann224094 Przed
#ann224160 nami
#ann224159 dni,
noana jeszcze trzy
python r xml-parsing html-parsing tei
3个回答
1
投票
from bs4 import BeautifulSoup

txt = '''<p><hlstart ana="#ann224094"></hlstart>Przed<hlend ana="#ann224094"></hlend> <hlstart ana="#ann224160"></hlstart>nami <hlend ana="#ann224160"></hlend>jeszcze trzy <hlstart ana="#ann224159"></hlstart>dni,<hlend ana="#ann224159"></hlend></p>'''

soup = BeautifulSoup(txt, 'html.parser')

out = []
for t in soup.find_all(text=True):
    if t.strip() == '':
        continue

    prev = t.find_previous_sibling()
    if prev.name == 'hlstart':
        out.append( (t, prev['ana']) )
    else:
        out.append( (t, 'noana') )

# print it to screen:
from pprint import pprint
pprint(out)

1
投票

另一种方法,使用lxml:


0
投票

另一种方法,使用SimplifiedDoc:)

© www.soinside.com 2019 - 2024. All rights reserved.