最近,我正在研究如何提取结构的问题。但是我不怎么解决。我有一个这样的HTML报废:
<ul>
<li><object type="text/sitemap"><param name="Name" value="level1"/)</object>
<ul>
<li><object type="text/sitemap"><param name="Name" value="data1"/></object></li>
<li><object type="text/sitemap"><param name="Name" value="level2"/></object>
<ul>
<li><object type="text/sitemap"><param name="Name" value="data2"/></object></li>
<li><object type="text/sitemap"><param name="Name" value="data3"/></object></li>
</ul>
</ul>
</ul>
我想要这样获得所需的输出:
output = [level1 --- data1, level1 --- level2 ----data2, level1 ---- level2 ----data3]
我该怎么办?
我不知道您的HTML标签是否丢失。添加后,使用简化文档的运行结果如下
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<ul>
<li><object type="text/sitemap"><param name="Name" value="level1"/></object>
<ul>
<li><object type="text/sitemap"><param name="Name" value="data1"/></object></li>
<li><object type="text/sitemap"><param name="Name" value="level2"/></object>
<ul>
<li><object type="text/sitemap"><param name="Name" value="data2"/></object></li>
<li><object type="text/sitemap"><param name="Name" value="data3"/></object></li>
</ul>
</ul>
</ul>'''
def test(ul,l_name,lst):
l1s = ul.children
l1_len=len(l1s)
for i in range(0,l1_len-1): # level1
l1 = l1s[i]
l1n = l1s[i+1]
if l1.tag=='li':
if l1n.tag=='ul':
l_name = l_name+'-'+l1.param.value if l_name else l1.param.value
test(l1n,l_name,lst)
else:
lst.append(l_name+'-'+l1.param.value)
if i==l1_len-2:
if l1n.tag=='li':
lst.append(l_name+'-'+l1n.param.value)
doc = SimplifiedDoc()
doc.loadHtml(doc.replaceReg(html,'</object>[\s]*<ul','</object></li><ul'))
ul = doc.ul
lst = []
test(ul,'',lst)
print (lst)
结果:
['level1-data1', 'level1-level2-data2', 'level1-level2-data3']
您可以获得SimplifiedDoc here的示例>