如何使用 selenium / requests / beautifulsoup 将表抓取到数据框中?

问题描述 投票:0回答:2

我的目标是对于网站https://data.eastmoney.com/executive/000001.html,当你向下滚动时,你会发现一个大桌子 000001 Table 我想把它变成Python中的DataFrame。 BeautifulSoup 足以做到这一点还是我必须使用 Selenium?

Stack Overflow 上有人说 BeautifulSoup 无法从网上爬取表数据,所以我尝试了 Selenium,代码如下:

driver = webdriver.Chrome()
driver.get('https://data.eastmoney.com/executive/000001.html')
table_element = driver.find_element_by_xpath("//table")
item_element = table_element.find_element_by_xpath("//tr[2]/td[3]")
item_text = item_element.text
df = pd.DataFrame([item_text], columns=["Item"])
print(df)
driver.quit()

这是结果:

Traceback (most recent call last):
  File "selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in    consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in   msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 85, in handle_data
    driver = webdriver.Chrome()
  File "selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in     PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

基本上它说“Chromedriver.exe 需要位于 PATH 中”。问题是我正在使用一个名为 JoinQuant (www.joinquant.com) 的在线回测平台,并且所有 Python 文件(例如文件“selenium/webdriver/common/service.py”)都不是本地的 - 它们不在我的计算机上磁盘驱动器。所以这对 Selenium 来说很复杂——我是否必须使用 Selenium 从互联网上抓取这样的数据并将其转换为 Python 中的 DataFrame ?或者我可以使用其他东西,比如 BeautifulSoup 吗?对于 BeautifulSoup 来说,至少它不存在“驱动器需要位于 PATH 中”的问题。

对于 BeautifulSoup,这是我尝试过的:

# Web Crawler
# Sent HTTP Request to get Internet content
url = 'https://data.eastmoney.com/executive/000001.html'
response = requests.get(url)
html_content = response.text

# Check if the request is successful
if response.status_code == 200:
    # Use BeautifulSoup to Analyze Internet information and get the table
    soup = BeautifulSoup(html_content, 'html.parser')
    table = soup.find_all('table')
    # Acquire the rows and columns of the table
    rows = table.find_all('tr')
    data = []
    for row in rows:
        cols = row.find_all('td')
        row_data = []
        for col in cols:
            row_data.append(col.text.strip())
        data.append(row_data)
else:
    print("Failed to Retrieve the Webpage.")

# Set up DataFrame
dataframe = pd.DataFrame(data)
# Print DataFrame
print(dataframe)

这是输出:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in  msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
  File "bs4/element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a   single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of   items like a single item. Did you call find_all() when you meant to call find()?

但是如果你改变

table = soup.find_all('table')

进入

table = soup.find('table')

结果如下:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'

总而言之,我应该使用哪一个?硒还是 BeautifulSoup?或者甚至是其他什么?我应该如何解决这个问题?

python selenium-webdriver web-scraping beautifulsoup selenium-chromedriver
2个回答
0
投票

我看到的主要问题是您使用的是非常旧的 Selenium 版本,并且您没有指定 Chromedriver 的路径(并且它不在 PATH 中),因此它找不到它。如果您更新到最新的 Selenium(至少 4.6),他们添加了一个内置的浏览器管理器 SeleniumManager,它将负责下载和配置您需要的任何驱动程序。

如果你确实走这条路,一旦升级 Selenium,你就会看到

.find_element_by_*()
,例如
.find_element_by_xpath()
,不再工作了。您需要将它们更改为新语法,

from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, "//table")

对于您的 Beautiful Soup 错误...我对此没有经验,但通过查看堆栈跟踪和错误消息,我认为问题出在这些行上

table = soup.find_all('table')
rows = table.find_all('tr')

table
是一个集合,您对该集合执行了
.find_all()
,但这不起作用。如果您将
.find_all('table')
更改为
.find('table')
,它应该可以工作。更新了下面的代码,

table = soup.find('table')
rows = table.find_all('tr')

0
投票

不需要使用

selenium
beautifulsoup
,在我看来,最简单/最直接的方法是使用从中提取数据的API:

网址:

https://datacenter-web.eastmoney.com/api/data/v1/get

参数:

reportName: RPT_EXECUTIVE_HOLD_DETAILS
columns: ALL
filter: (SECURITY_CODE="000001")
pageNumber: 1
pageSize: 100 #increase this to avoid paging
示例
import requests
import pandas as pd

pd.DataFrame(
    requests.get('https://datacenter-web.eastmoney.com/api/data/v1/get?reportName=RPT_EXECUTIVE_HOLD_DETAILS&columns=ALL&filter=(SECURITY_CODE%3D%22000001%22)&pageNumber=1&pageSize=30')\
        .json().get('result').get('data')
)
安全代码 派生安全代码 SECURITY_NAME 更改_日期 PERSON_NAME CHANGE_SHARES 平均价格 CHANGE_AMOUNT CHANGE_REASON CHANGE_RATIO CHANGE_AFTER_HOLDNUM 保持类型 DSE_PERSON_NAME POSITION_NAME PERSON_DSE_RELATION 组织代码 GGEID BEGIN_HOLD_NUM END_HOLD_NUM
0 000001 000001.SZ 平安银行 2021-09-06 00:00:00 谢永林 26700 18.01 480867 竞价交易 0.0001 26700 A股 谢永林 董事 本人 10004085 173000004782302008 26700
1 000001 000001.SZ 平安银行 2021-09-06 00:00:00 项有志 4000 18.46 73840 竞价交易 0.0001 26000 A股 项有志 董事、副行长、首席财务官 本人 10004085 173000004782302010 26000
...
32 000001 000001.SZ 平安银行 2009-08-19 00:00:00 刘巧莉 46200 21.04 972048 竞价交易 0.0015 A股 马黎民 监事 10004085 140000000281406241
33 000001 000001.SZ 平安银行 2007-07-09 00:00:00 王魁芝 1600 27.9 44640 二级市场买卖 0.0001 7581 A股 王魁芝 监事 本人 10004085 173000001049726006 5981 7581
© www.soinside.com 2019 - 2024. All rights reserved.