我试图从 html 文件中获取每个 href 元素内的不同值。对于 html 文件上的每个不同的 href 元素,应该有 1 行与提供的标题上的每个字段相匹配。
这是 html 文件:
<tbody>
<tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04
</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
<td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
</td>
<td><span class="series">B</span>
</td><td>03:20</td><td>
<span class="">22.5</span></td><td>0</td>
<td>22.5</td><td>0</td><td>0
</td><td>3</td><td>67.20</td>
<td>1</td><td>0</td><td>0</td></tr>
<tr role="row" class="odd">
<td class="sorting_1">
<a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
<td><span class="series">B</span></td><td>03:20</td><td>
<span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
<td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
<td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/339083">AGUA</a></td>
<td><span class="series">*</span>
</td><td>03:20</td><td>
<span class="color-1">29</span>
</td><td>28.98</td><td>28.09</td>
<td>29</td><td>28</td><td>296,871</td>
<td>8,491,144.74</td><td>2,104</td><td>0.89</td>
<td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
<td>03:20</td>
<td><span class="color-2">13.48</span>
</td><td>13.46</td>
<td>13.53</td><td>13.62</td><td>13.32</td>
<td>2,706,398</td>
td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
<td>-0.52</td>
</tr><tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
</td><td>03:20</td><td><span class="color-2">10.65</span>
</td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
<td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
<td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
<tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
<td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/1862">AMX</a></td>
<td><span class="series">B</span></td><td>03:20</td>
<td><span class="color-2">14.56</span></td><td>14.58</td>
<td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
<td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
<td>-0.75</td></tr><tr role="row" class="even">
<td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
</td><td><span class="series">10</span></td><td>03:20</td><td>
<span class="color-2">21.09</span>
</td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
<td>51,005</td><td>1,076,281.67</td>
<td>22</td><td>-0.34</td><td>-1.59</td></tr>
</tbody>
我当前的代码会生成一个空数据框:
# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers=['EMISORA', 'SERIE', 'HORA', 'ÚLTIMO', 'PPP', 'ANTERIOR', 'MÁXIMO', 'MÍNIMO', 'VOLUMEN', 'IMPORTE', 'OPS.', 'VAR PUNTOS', 'VAR %']
df = pd.DataFrame(columns=headers)
# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
#print("The value in this cell is %s" % value)
# append row in dataframe
我想知道是否可以获得一个字段为的 pandas 数据框 标题列表中描绘的内容和行都是来自 href 的每个元素。 是否可以创建这样的数据集?
您可以使用 BeautifulSoup 解析 HTML 并从 href 属性中提取必要的信息。然后,使用这些信息构造一个 pandas DataFrame。
试试这个:
rows = soup.find_all('tr')
对于行中的行: # 找到该行中的锚标记 锚点 = row.find('a') 如果锚: # 提取href和文本内容 href = 锚['href'] 文本=anchor.text.strip()
# Find all cells in the row
cells = row.find_all('td')
# Extract other cell values
cell_values = [cell.text.strip() for cell in cells]
# Combine all values into a single row
row_data = [text, *cell_values]
# Append the row data to the main data list
data.append(row_data)
df = pd.DataFrame(数据,列=标题)