我有html / xml文件的以下部分:
<p><hlstart ana="#ann224094"></<hlstart>Przed<hlend ana="#ann224094"></hlend> <hlstart ana="#ann224160"></hlstart>nami <hlend ana="#ann224160"></hlend>jeszcze trzy <hlstart ana="#ann224159"></hlstart>dni,<hlend ana="#ann224159"></hlend></p>
我想提取文本和标签以将它们排列在表格中,例如:
text, nonana
text, ana
其中ana表示标签,例如来自#ann224094的标签>
<hlstart ana="#ann224094"></<hlstart>Przed<hlend ana="#ann224094"></hlend>
并且nonana表示文本没有ana标记。
przed, #ann224094
nami, #ann224160
jeszcze trzy, nonana
我已经尝试将bs4和htmlparser与xml数据的其他部分一起使用,但是我不理解该部分。我可以使用.text方法导出整个文本,所有字符串,但是我需要知道哪些单词具有ana标签。此外,所有带有ana标记的单词稍后都会在我的文件中带有特定标签。
Przed
]
from bs4 import BeautifulSoup
txt = '''<p><hlstart ana="#ann224094"></hlstart>Przed<hlend ana="#ann224094"></hlend> <hlstart ana="#ann224160"></hlstart>nami <hlend ana="#ann224160"></hlend>jeszcze trzy <hlstart ana="#ann224159"></hlstart>dni,<hlend ana="#ann224159"></hlend></p>'''
soup = BeautifulSoup(txt, 'html.parser')
out = []
for t in soup.find_all(text=True):
if t.strip() == '':
continue
prev = t.find_previous_sibling()
if prev.name == 'hlstart':
out.append( (t, prev['ana']) )
else:
out.append( (t, 'noana') )
# print it to screen:
from pprint import pprint
pprint(out)
打印:
[('Przed', '#ann224094'),
('nami ', '#ann224160'),
('jeszcze trzy ', 'noana'),
('dni,', '#ann224159')]
ana = """your html above"""
import lxml.html as lh
doc = lh.fromstring(ana)
targets = doc.xpath('//hlstart[@ana]')
nont = doc.xpath('//*[name() != "hlstart"]')
for target in targets:
if target.tail is not None:
print(target.attrib['ana'],target.tail.strip())
for n in nont:
if n.tail is not None and len(n.tail.strip())>0:
print('noanna ',n.tail.strip())
输出:
#ann224094 Przed
#ann224160 nami
#ann224159 dni,
noanna jeszcze trzy
from simplified_scrapy import SimplifiedDoc,utils
html = '''
<p>
<hlstart ana="#ann224094"></hlstart>Przed<hlend ana="#ann224094"></hlend>
<hlstart ana="#ann224160"></hlstart>nami <hlend ana="#ann224160"></hlend>
jeszcze trzy
<hlstart ana="#ann224159"></hlstart>dni,<hlend ana="#ann224159"></hlend></p>'''
doc = SimplifiedDoc(html)
for h in doc.p.hlstarts:
text = h.nextText()
if text: print(h.ana,text)
for h in doc.p.hlends:
text = h.nextText()
if text: print('noana',text)
结果:
#ann224094 Przed
#ann224160 nami
#ann224159 dni,
noana jeszcze trzy