如何提取特定文本并保存在数据框中

问题描述 投票:0回答:1

我是网络抓取的新手。我已经成功编写了这段代码,但是无法获取文本

有什么帮助或建议吗?

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.tfrrs.org/results/xc/22268/_FURMAN_XC_INVITE'

header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
        }
resp = requests.get(url, headers=header)
soup = BeautifulSoup(resp.text, 'html.parser')

pl_team = []
atleta = []

tabla = soup.find_all("table")
for t in tabla:
  tbody = soup.find_all('tbody')
  for y in tbody:
    tr = soup.find_all('tr')
    for row in tr:
      pl_team.append(row.find('td'))
      atleta.append(row.find_all('a'))

df = pd.DataFrame({"Marca":pl_team, "atleta":atleta})
df = df.reset_index(drop=True)
df.to_csv('tffrs2.csv')
print (df)
python web beautifulsoup screen-scraping
1个回答
0
投票

让您的生活更轻松,使用

.read_html
中的
pandas

你可以这样得到所有的表:

import requests
import pandas as pd

df = pd.read_html(
    requests.get("https://www.tfrrs.org/results/xc/22268/_FURMAN_XC_INVITE").text,
    flavor="bs4",
)
print(df[0])

输出示例:

   PL                 Team Total Time Avg. Time  Score  ...   3   4   5     6     7
0   1               Furman    1:44:29     20:53     25  ...   5   6   7  12.0  15.0
1   2            Tennessee    1:46:13     21:14     42  ...   9  10  13  16.0  17.0
2   3              Clemson    1:50:31     22:06     80  ...  19  20  22  25.0  26.0
3   4  Charleston Southern    1:54:36     22:55    116  ...  27  31  33  34.0  35.0
4   5            Charlotte    1:56:15     23:15    128  ...  24  28  32  36.0  37.0
5   6             Montreat    2:01:51     24:22    177  ...  38  39  41  42.0  43.0
6   7    Southern Wesleyan    2:15:18     27:03    247  ...  51  52  53  55.0  60.0
7   8            Bob Jones    2:16:32     27:18    254  ...  50  54  57  58.0  64.0
8   9         Gardner-Webb    2:19:14     27:50    258  ...  49  61  62   NaN   NaN
9  10         USC-Beaufort    2:42:04     32:24    309  ...  63  65  66   NaN   NaN

[10 rows x 12 columns]
© www.soinside.com 2019 - 2024. All rights reserved.