我正试图对一个网站的部分内容进行刮擦,因为我想把它转移到excel中以便于操作。
该网站是 这个环节
我的代码对于第一页的数据可以正常工作,但如你所见,列表跨越了好几页,要访问这些页面。&page=#number of page
需要添加到地址中。我想我可以迭代我的代码,把元素追加到熊猫数组中。但是,我找不到如何检测最后一个页面?
当数据被分割到几个页面与时,是这样做的吗?谢谢你的帮助。
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_colwidth', -1)
pd.options.display.float_format = '{:,.2f}'.format
url = "https://www.boursorama.com/bourse/produits-de-bourse/levier/warrants/resultats?\
warrant_filter%5Bnature%5D=1&\
warrant_filter%5BunderlyingType%5D=&\
warrant_filter%5BunderlyingName%5D=TESLA&\
warrant_filter%5Bmaturity%5D=0&\
warrant_filter%5BdeltaMin%5D=&\
warrant_filter%5BdeltaMax%5D=&\
warrant_filter%5Bissuer%5D=&\
warrant_filter%5Bsearch%5D="
def parse_html_table(table):
n_columns = 0
n_rows=0
column_names = []
# Find number of rows and columns
# we also find the column titles if we can
for row in table.find_all('tr'):
# Determine the number of rows in the table
td_tags = row.find_all('td')
if len(td_tags) > 0:
n_rows+=1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
index= range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
# Convert to float if possible
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
#import pdb; pdb.set_trace()
table=soup.find_all('table')[0]
df=parse_html_table(table)
df=df.replace({'\n': ''}, regex=True)
为什么最后一个分页链接没有得到(要么是 >>
或在你的例子中的网址 8
),并提取最后一页的 网址 属性?像这样的。
pagination_links = soup.findAll("a", {"class" : "c-pagination__link"})
last_page = pagination_links[-1]['href'].split('page=')[-1]
通常我会得到最后一页,并得到所有的页面, 但这个网站没有让我得到最后一页。 这个过程将在检查最后一页后完成。 pandas.read_html是很容易的。
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.boursorama.com/bourse/produits-de-bourse/levier/warrants/resultats?\
warrant_filter%5Bnature%5D=1&\
warrant_filter%5BunderlyingType%5D=&\
warrant_filter%5BunderlyingName%5D=TESLA&\
warrant_filter%5Bmaturity%5D=0&\
warrant_filter%5BdeltaMin%5D=&\
warrant_filter%5BdeltaMax%5D=&\
warrant_filter%5Bissuer%5D=&\
warrant_filter%5Bsearch%5D="
frames = []
i = 0
for i in range(19):
r = requests.get(url+'page={}'.format(i))
df_list = pd.read_html(r.text)
df = df_list[0]
frames.append(df)
i += 1
res = pd.concat(frames, ignore_index=True)