如何获取 html 文件中每个 href 的所有元素作为行并将其附加到 pandas 数据集模式?

问题描述 投票:0回答:1

我试图从 html 文件中获取每个 href 元素内的不同值。对于 html 文件上的每个不同的 href 元素,应该有 1 行与提供的标题上的每个字段相匹配。

这是 html 文件:


  <tbody>
    
  <tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
  
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04

</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
  <td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
  </td>
  <td><span class="series">B</span>
  </td><td>03:20</td><td>
    <span class="">22.5</span></td><td>0</td>
    <td>22.5</td><td>0</td><td>0

    </td><td>3</td><td>67.20</td>
    <td>1</td><td>0</td><td>0</td></tr>
    <tr role="row" class="odd">
      <td class="sorting_1">
        <a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
      <td><span class="series">B</span></td><td>03:20</td><td>
        <span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
        <td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
        <td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
          <a href="/es/mercados/cotizacion/339083">AGUA</a></td>
          <td><span class="series">*</span>
          </td><td>03:20</td><td>
            <span class="color-1">29</span>
          </td><td>28.98</td><td>28.09</td>
            <td>29</td><td>28</td><td>296,871</td>
            <td>8,491,144.74</td><td>2,104</td><td>0.89</td>
            <td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
              <td>03:20</td>
              <td><span class="color-2">13.48</span>
              </td><td>13.46</td>
              <td>13.53</td><td>13.62</td><td>13.32</td>
              <td>2,706,398</td>
              td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
              <td>-0.52</td>
            </tr><tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
              </td><td>03:20</td><td><span class="color-2">10.65</span>
            </td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
            <td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
            <td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
            </td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
            <tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
              <td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1862">AMX</a></td>
              <td><span class="series">B</span></td><td>03:20</td>
              <td><span class="color-2">14.56</span></td><td>14.58</td>
              <td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
              <td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
              <td>-0.75</td></tr><tr role="row" class="even">
                <td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
              </td><td><span class="series">10</span></td><td>03:20</td><td>
                <span class="color-2">21.09</span>
              </td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
              <td>51,005</td><td>1,076,281.67</td>
              <td>22</td><td>-0.34</td><td>-1.59</td></tr>
      </tbody>

我当前的代码会生成一个空数据框:

# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup

headers=['EMISORA', 'SERIE', 'HORA', 'ÚLTIMO', 'PPP', 'ANTERIOR', 'MÁXIMO', 'MÍNIMO', 'VOLUMEN', 'IMPORTE', 'OPS.', 'VAR PUNTOS', 'VAR %']
df = pd.DataFrame(columns=headers)

# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string

        #print("The value in this cell is %s" % value)

        # append row in dataframe

我想知道是否可以获得一个字段为的 pandas 数据框 标题列表中描绘的内容和行都是来自 href 的每个元素。 是否可以创建这样的数据集?

python html pandas web-scraping dataset
1个回答
0
投票

您可以使用 BeautifulSoup 解析 HTML 并从 href 属性中提取必要的信息。然后,使用这些信息构造一个 pandas DataFrame。

试试这个:

查找所有行

rows = soup.find_all('tr')

迭代每一行

对于行中的行: # 找到该行中的锚标记 锚点 = row.find('a') 如果锚: # 提取href和文本内容 href = 锚['href'] 文本=anchor.text.strip()

    # Find all cells in the row
    cells = row.find_all('td')
    # Extract other cell values
    cell_values = [cell.text.strip() for cell in cells]
    
    # Combine all values into a single row
    row_data = [text, *cell_values]
    
    # Append the row data to the main data list
    data.append(row_data)

创建数据框

df = pd.DataFrame(数据,列=标题)

© www.soinside.com 2019 - 2024. All rights reserved.