如何构建Etherscan网络爬虫? [重复]

问题描述 投票:0回答:1
我正在构建一个网络爬虫,该网络爬虫每30秒会不断刷新大量的etherscan URL,如果发生了任何新的未计入的转移,它将向我发送电子邮件通知和指向etherscan上相关地址的链接,这样我就可以手动将其签出。

我想保持联系的地址之一在这里:

https://etherscan.io/token/0xd6a55c63865affd67e2fb9f284f87b7a9e5ff3bd?a=0xd071f6e384cf271282fc37eb40456332307bb8af

到目前为止我所做的:

from urllib.request import Request, urlopen url = 'https://etherscan.io/token/0xd6a55c63865affd67e2fb9f284f87b7a9e5ff3bd?a=0x94f52b6520804eced0accad7ccb93c73523af089' req = Request(url, headers={'User-Agent': 'XYZ/3.0'}) # I got this line from another post since "uClient = uReq(URL)" and "page_html = uClient.read()" would not work (I beleive that etherscan is attemption to block webscraping or something?) response = urlopen(req, timeout=20).read() response_close = urlopen(req, timeout=20).close() page_soup = soup(response, "html.parser") Transfers_info_table_1 = page_soup.find("div", {"class": "table-responsive"}) print(Transfers_info_table_1)

有趣的是,当我运行它时,得到以下输出:

<div class="table-responsive" style="visibility:hidden;"> <iframe frameborder="0" id="tokentxnsiframe" scrolling="no" src="" style="width: 100px; height: 600px; min-width: 100%;"></iframe> </div>

我原本希望获得整个转帐表的输出。我在这里错了吗?    
python-3.x web-scraping beautifulsoup web-crawler etherscan
1个回答
1
投票
由于该表位于iframe内部。复制iframe的src值,然后使用请求获取该url的内容。

from urllib.request import Request, urlopen from bs4 import BeautifulSoup as soup import pandas as pd url = 'https://etherscan.io/token/generic-tokentxns2?m=normal&contractAddress=0xd6a55c63865affd67e2fb9f284f87b7a9e5ff3bd&a=0xd071f6e384cf271282fc37eb40456332307bb8af' req = Request(url, headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}) # I got this line from another post since "uClient = uReq(URL)" and "page_html = uClient.read()" would not work (I beleive that etherscan is attemption to block webscraping or something?) response = urlopen(req, timeout=20).read() response_close = urlopen(req, timeout=20).close() page_soup = soup(response, "html.parser") Transfers_info_table_1 = page_soup.find("table", {"class": "table table-md-text-normal table-hover mb-4"}) df=pd.read_html(str(Transfers_info_table_1))[0] df.to_csv("TransferTable.csv",index=False)

生成的csv。

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.