使用BeautifulSoup提取 元素并包含存储结构

问题描述 投票:0回答:1

最近,我正在研究如何提取结构的问题。但是我不怎么解决。我有一个这样的HTML报废:

<ul>
    <li><object type="text/sitemap"><param name="Name" value="level1"/)</object>
    <ul>
        <li><object type="text/sitemap"><param name="Name" value="data1"/></object></li>
        <li><object type="text/sitemap"><param name="Name" value="level2"/></object>
        <ul>
            <li><object type="text/sitemap"><param name="Name" value="data2"/></object></li>    
            <li><object type="text/sitemap"><param name="Name" value="data3"/></object></li>
        </ul>
    </ul>
</ul>

我想要这样获得所需的输出:

output = [level1 --- data1, level1 --- level2 ----data2, level1 ---- level2 ----data3]

我该怎么办?

beautifulsoup html-lists
1个回答
0
投票

我不知道您的HTML标签是否丢失。添加后,使用简化文档的运行结果如下

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<ul>
    <li><object type="text/sitemap"><param name="Name" value="level1"/></object>
    <ul>
        <li><object type="text/sitemap"><param name="Name" value="data1"/></object></li>
        <li><object type="text/sitemap"><param name="Name" value="level2"/></object>
        <ul>
            <li><object type="text/sitemap"><param name="Name" value="data2"/></object></li>    
            <li><object type="text/sitemap"><param name="Name" value="data3"/></object></li>
        </ul>
    </ul>
</ul>'''
def test(ul,l_name,lst):
  l1s = ul.children
  l1_len=len(l1s)
  for i in range(0,l1_len-1): # level1
    l1 = l1s[i]
    l1n = l1s[i+1]
    if l1.tag=='li':
      if l1n.tag=='ul':
        l_name = l_name+'-'+l1.param.value if l_name else l1.param.value
        test(l1n,l_name,lst)
      else:
        lst.append(l_name+'-'+l1.param.value)
    if i==l1_len-2:
      if l1n.tag=='li':
        lst.append(l_name+'-'+l1n.param.value)
doc = SimplifiedDoc()
doc.loadHtml(doc.replaceReg(html,'</object>[\s]*<ul','</object></li><ul'))
ul = doc.ul
lst = []
test(ul,'',lst)
print (lst)

结果:

['level1-data1', 'level1-level2-data2', 'level1-level2-data3']

您可以获得SimplifiedDoc here的示例>

© www.soinside.com 2019 - 2024. All rights reserved.