我是网络抓取的新手。我已经成功编写了这段代码,但是无法获取文本
有什么帮助或建议吗?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.tfrrs.org/results/xc/22268/_FURMAN_XC_INVITE'
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
resp = requests.get(url, headers=header)
soup = BeautifulSoup(resp.text, 'html.parser')
pl_team = []
atleta = []
tabla = soup.find_all("table")
for t in tabla:
tbody = soup.find_all('tbody')
for y in tbody:
tr = soup.find_all('tr')
for row in tr:
pl_team.append(row.find('td'))
atleta.append(row.find_all('a'))
df = pd.DataFrame({"Marca":pl_team, "atleta":atleta})
df = df.reset_index(drop=True)
df.to_csv('tffrs2.csv')
print (df)
让您的生活更轻松,使用
.read_html
中的 pandas
。
你可以这样得到所有的表:
import requests
import pandas as pd
df = pd.read_html(
requests.get("https://www.tfrrs.org/results/xc/22268/_FURMAN_XC_INVITE").text,
flavor="bs4",
)
print(df[0])
输出示例:
PL Team Total Time Avg. Time Score ... 3 4 5 6 7
0 1 Furman 1:44:29 20:53 25 ... 5 6 7 12.0 15.0
1 2 Tennessee 1:46:13 21:14 42 ... 9 10 13 16.0 17.0
2 3 Clemson 1:50:31 22:06 80 ... 19 20 22 25.0 26.0
3 4 Charleston Southern 1:54:36 22:55 116 ... 27 31 33 34.0 35.0
4 5 Charlotte 1:56:15 23:15 128 ... 24 28 32 36.0 37.0
5 6 Montreat 2:01:51 24:22 177 ... 38 39 41 42.0 43.0
6 7 Southern Wesleyan 2:15:18 27:03 247 ... 51 52 53 55.0 60.0
7 8 Bob Jones 2:16:32 27:18 254 ... 50 54 57 58.0 64.0
8 9 Gardner-Webb 2:19:14 27:50 258 ... 49 61 62 NaN NaN
9 10 USC-Beaufort 2:42:04 32:24 309 ... 63 65 66 NaN NaN
[10 rows x 12 columns]