从HTML提取表信息(作为文本文件)

问题描述 投票:0回答:1

我正在尝试从html文件中的表中提取信息,我想将其用作文本,因为我只能通过VPN访问此文件,因此我已经下载了所有需要的html文件。

我想专门从同一表类的各个表中获取信息,但是当我尝试获取信息时,没有任何返回。我已经附上了我试图用来获取此信息的代码,但是没有成功。

下面也是我一直试图从中获取信息的html文件,但是它很大,所以我希望这不会成为问题

Table Information

<table class="region-table">
 <thead>
  <tr>
   <th>Region</th>
   <th>Type</th>
   <th>From</th>
   <th>To</th>
   <th colspan="2">Most similar known cluster</th>
   <th>Similarity</th>
  </tr>
 </thead>
 <tbody>
 <tr class="linked-row odd" data-anchor="#r1c1">
     
     
     <td class="regbutton NRPS-like r1c1">
      <a href="#r1c1">Region&nbsp;1.1</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
     </td>
     <td class="digits">21,469</td>
     <td class="digits table-split-left">62,957</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001740/1" target="_blank">phthoxazolin</a></td>
      <td>NRP + Polyketide</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 4%, #ffffff00 4%)">4%</td>
     
   </tr>
 <tr class="linked-row even" data-anchor="#r1c2">
     
     
     <td class="regbutton NRPS r1c2">
      <a href="#r1c2">Region&nbsp;1.2</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
     </td>
     <td class="digits">74,163</td>
     <td class="digits table-split-left">124,963</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001709/1" target="_blank">nystatin</a></td>
      <td>Polyketide</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 10%, #ffffff00 10%)">10%</td>
     
   </tr>
 
 </tbody>
</table>
<table class="region-table">
 <thead>
  <tr>
   <th>Region</th>
   <th>Type</th>
   <th>From</th>
   <th>To</th>
   <th colspan="2">Most similar known cluster</th>
   <th>Similarity</th>
  </tr>
 </thead>
 <tbody>
 <tr class="linked-row odd" data-anchor="#r2c1">
     
     
     <td class="regbutton terpene r2c1">
      <a href="#r2c1">Region&nbsp;2.1</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
     </td>
     <td class="digits">3,800</td>
     <td class="digits table-split-left">23,263</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001580/1" target="_blank">ebelactone</a></td>
      <td>Polyketide</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 5%, #ffffff00 5%)">5%</td>
     
   </tr>
 <tr class="linked-row even" data-anchor="#r2c2">
     
     
     <td class="regbutton NRPS-like r2c2">
      <a href="#r2c2">Region&nbsp;2.2</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
     </td>
     <td class="digits">55,320</td>
     <td class="digits table-split-left">97,088</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000727/1" target="_blank">indigoidine</a></td>
      <td>Saccharide</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 17%, #ffffff00 17%)">17%</td>
     
   </tr>
 <tr class="linked-row odd" data-anchor="#r2c3">
     
     
     <td class="regbutton NRPS r2c3">
      <a href="#r2c3">Region&nbsp;2.3</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
     </td>
     <td class="digits">144,740</td>
     <td class="digits table-split-left">193,599</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000368/1" target="_blank">streptobactin</a></td>
      <td>NRP</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(210, 105, 30, 0.3), rgba(210, 105, 30, 0.3) 70%, #ffffff00 70%)">70%</td>
     
   </tr>
 <tr class="linked-row even" data-anchor="#r2c4">
     
     
     <td class="regbutton siderophore r2c4">
      <a href="#r2c4">Region&nbsp;2.4</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#siderophore" target="_blank">siderophore</a>
     </td>
     <td class="digits">347,862</td>
     <td class="digits table-split-left">362,833</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001593/1" target="_blank">ficellomycin</a></td>
      <td>NRP</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 3%, #ffffff00 3%)">3%</td>
     
   </tr>
 <tr class="linked-row odd" data-anchor="#r2c5">
     
     
     <td class="regbutton lassopeptide r2c5">
      <a href="#r2c5">Region&nbsp;2.5</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#lassopeptide" target="_blank">lassopeptide</a>
     </td>
     <td class="digits">548,017</td>
     <td class="digits table-split-left">570,561</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001435/1" target="_blank">ikarugamycin</a></td>
      <td>NRP + Polyketide:Iterative type I</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
     
   </tr>
 <tr class="linked-row even" data-anchor="#r2c6">
     
     
     <td class="regbutton NRPS r2c6">
      <a href="#r2c6">Region&nbsp;2.6</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
     </td>
     <td class="digits">628,834</td>
     <td class="digits table-split-left">683,050</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001117/1" target="_blank">himastatin</a></td>
      <td>NRP</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
     
   </tr>
 <tr class="linked-row odd" data-anchor="#r2c7">
     
     
         
     
     <td class="regbutton NRPS,terpene hybrid r2c7">
      <a href="#r2c7">Region&nbsp;2.7</a>
     </td>
     <td>
       
       <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>,<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
     </td>
     <td class="digits">1,043,511</td>
     <td class="digits table-split-left">1,104,786</td>
     
      
        
      
      <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0002024/1" target="_blank">nargenicin</a></td>
      <td>Polyketide</td>
      <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 11%, #ffffff00 11%)">11%</td>
     
   </tr>
 
 </tbody>
</table>

代码段

soup = BeautifulSoup(html, "lxml")
gdp_table = soup.find("table", attrs={"class": "region-table"})
gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows
# Get all the headings of Lists
print ("Extracted {num} Region-Tables".format(num=len(gdp_table_data)))
print(gdp_table_data[0]) #print first table
print(gdp_table_data[1]) #print second table

理想情况下,我想输入html文件并提取所有不同的表信息,合并为一个大表并输出为csv。

python html beautifulsoup html-parsing
1个回答
0
投票

从文件中获取HTML数据,并将提取的数据附加到文件中。

© www.soinside.com 2019 - 2024. All rights reserved.