Python pandas解析html表以获取隐藏的值和链接

问题描述 投票:1回答:1

这是我尝试使用Python与Pandas解析的页面的摘录:

<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>

</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>



<tr class='even'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
    <td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span>  <span class='gen' title='extra_info2'>stuff2</span>   <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>

<tr class='even'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
    <td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span>  <span class='gen' title='extra_info2'>stuff</span>   <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>

<tr class='odd'><td class='center'><form action='get.php' method='GET'>
    <input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
    <td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span>  <span class='gen' title='extra_info2'>stuff3</span>  </td></tr>

</table>

表中有隐藏的变量(标题6和标题9),将鼠标悬停在该变量上,您可以看到以下信息:

screenshot

当我尝试使用Pandas时,会得到以下信息:

with open ("/root/Downloads/adad.html", "r") as content_file:
    f = content_file.read()
dfs = pd.read_html(f)
dfs

screenshot2

我希望获得以下内容:

[   header1info    header2info    header3info    header4info    header5info    header6info         header7          header8                header9info
0   value1         stuff          stuff          stuff          stuff          stuff(extra_info)   stuff            link1(http://link1)    stuff(extra_info) stuff2(extra_info2) out(http://out)
                                                                                                                    link2(http://link2)
                                                                                                                    link3(http://link3)
1   value2         stuff2         stuff2         stuff2         stuff2         stuff2              stuff2           link4(http://link4)    stuff(extra_info) stuff(extra_info2) out2(http://out)
                                                                                                                    link5(http://link5)  
2   value3         stuff3        stuff3          stuff3         stuff3         stuff3              stuff3           link6(http://link6)    stuff3(extra_info) stuff3(extra_info2)]

可以使用熊猫吗?如果是,如何获得所需的输出?

[对不起,关于熊猫,我不是专家。我不确定是否还有其他方法可以解析这些信息。我唯一想到的就是分割线并获取所需的信息,但您只能想象这是多么的严谨...

python pandas python-2.x
1个回答
1
投票

简短的回答:否

© www.soinside.com 2019 - 2024. All rights reserved.