我正在寻找世界银行的公开名单
我不需要分支机构和完整地址 - 但需要名称和网站。我想到数据... XML、CSV ... 具有这些字段: 银行名称、国家/地区名称或国家/地区代码(ISO 两个字母) 网站:可选:银行总部所在城市 对于每家银行,每个所在国家/地区有一条记录。 顺便说一句:特别是小银行很有趣
我发现了一个很棒的页面,非常非常全面 - 看 - 它有欧洲 9000 家银行:
从头到尾看:
**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla
**B**
https://thebanks.eu/search?bank=&country=Belgium
**U**
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom
查看详细页面:https://thebanks.eu/banks/9563
我需要这些数据 联系方式
Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/
方法:我的方法是使用bs4、request和pandas
btw:也许我们可以从 0 计数到 100 000 - 为了获得存储在数据库中的所有银行:
查看详细页面:https://thebanks.eu/banks/9563
我在colab上运行这个:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Function to scrape bank data from my URL
def scrape_bank_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# here we try to find bank name, country, and website
bank_name = soup.find("h1", class_="entry-title").text.strip()
country = soup.find("span", class_="country-name").text.strip()
website = soup.find("a", class_="site-url").text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
bank_links = soup.find_all("div", class_="search-bank")
for bank_link in bank_links:
bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
bank_info = scrape_bank_data(bank_url)
bank_data.append(bank_info)
# and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# subsequently we print the DataFrame
print(df)
看看返回了什么
Empty DataFrame
Columns: []
Index: []
在我看来,抓取过程存在问题。我尝试了一些不同的方法,一次又一次地检查网页上的元素,以确保我在页面上提取正确的信息。
还应该打印出一些额外的调试信息来帮助诊断问题。
更新:晚上好亲爱的@Asish M.和@eternal_white 非常感谢您的评论和分享您的想法:思考的食物:至于 Selenium - 我认为这是一个好主意 - 以及在 Google-Colab 上运行它(selenium),我从 Jacob Padilla 那里学到了 @Jacob / @user:21216449 :: 请参阅 Jacobs 页面:https://github.com/jpjacobpadilla 以及 Google-Colab-Selenium:https://github.com/jpjacobpadilla/Google-Colab-Selenium 和默认选项:
The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
• --headless: Runs Chrome in headless mode (without a GUI).
• --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment.
• --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers.
• --lang=en: Sets the language to English.
嗯,我认为这种方法值得考虑:所以我们可以像这样:
在 Google Colab 中使用 Selenium 来绕过 Cloudflare(您提到的 - permanent_white)阻塞并抓取所需的数据可能是很好但可行的方法。这里有一些关于分步方法的想法 - 以及如何使用 Jacob Padilla 的 google-colab-selenium 包进行设置:
Install google-colab-selenium:
You can install the google-colab-selenium package using pip:
diff
!pip 安装 google-colab-selenium
我们还需要安装Selenium:
差异
!pip install selenium
Import Necessary Libraries:
Import the required libraries in your Colab notebook:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time
然后我们需要设置 Selenium WebDriver: 使用必要的选项配置 Chrome WebDriver:
# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)
这里我们将定义用于抓取的函数: 我们定义了一个使用 Selenium 来抓取银行数据的函数:
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5) # first of all - we let the page load completely
bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
然后我们可以去抓取数据:现在我们可以使用定义的函数来抓取数据:
# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# hmm - we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# now we can iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))
# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# Print the DataFrame
print(df)
并且 - 一次单次拍摄:
# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx:
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time
# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222') # Add this option
# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)
# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5) # Let the page load completely
bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# Add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))
# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# Print the DataFrame
print(df)
# Close the WebDriver
driver.quit()
看看我有什么回来了 - 在 google-colab 上:
TypeError Traceback (most recent call last)
<ipython-input-4-76a7abf92dba> in <cell line: 21>()
19
20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
22
23 # Define function to scrape bank data using Selenium
TypeError: WebDriver.__init__() got multiple values for argument 'options'
该网站受 cloudflare 保护,因此最好使用代理绕过
import requests
from bs4 import BeautifulSoup
from lxml import etree
import pandas as pd
from pdb import set_trace
from urllib.parse import urlencode
import json
# Get your own api_key from scrapeops or some other proxy vendor
API_KEY = "api_key"
def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
# Function to scrape bank data from my URL
def scrape_bank_data(url):
proxy_url = get_scrapeops_url(url)
response = requests.get(proxy_url)
soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
# here we try to extract contact details
contact_details = []
contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
for contact in contacts_nodes:
contact_str = contact.text
# Web site link is inside 'a' tag hence using some conditions
if not contact_str:
contact_str = contact.xpath(".//a/@href")[0]
# email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
if (contact_str and contact_str.count("email") > 0):
json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
data_dict = json.loads(json_str)
contact_str = data_dict["email"]
contact_details.append(contact_str.strip())
return ", ".join(contact_details)
# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search?bank=&country=Albania",
# "https://thebanks.eu/search?bank=&country=Andorra",
# we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
proxy_url = get_scrapeops_url(url)
response = requests.get(proxy_url)
soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")
for bank in bank_details:
bank_info = {}
bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
contacts = scrape_bank_data(bank_url)
bank_info["Contacts"] = contacts
print(bank_info)
bank_data.append(bank_info)
# and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# subsequently we print the DataFrame
print(df)
输出:
Bank Name Country Website Contacts
0 Alpha Bank - Albania S.A. Albania https://thebanks.eu/banks/19331 Street of Kavaja, G - KAM Business Center, 2 f...
1 American Bank of Investments S.A. Albania https://thebanks.eu/banks/19332 Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2 Bank of Albania Albania https://thebanks.eu/banks/19343 Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3 Banka Kombetare Tregstare SH.A. Albania https://thebanks.eu/banks/19336 Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4 Credins Bank S.A. Albania https://thebanks.eu/banks/19333 Municipal Borough no. 5, street "Vaso Pasha", ...
5 First Investment Bank, Albania S.A. Albania https://thebanks.eu/banks/19334 Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6 Intesa Sanpaolo Bank Albania S.A. Albania https://thebanks.eu/banks/19335 Street “Ismail Qemali”, No. 27, Tirana, Albani...
7 OTP Bank Albania S.A Albania https://thebanks.eu/banks/19337 Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8 Procredit Bank S.A. Albania https://thebanks.eu/banks/19338 Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9 Raiffeisen Bank S.A. Albania https://thebanks.eu/banks/19339 Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10 Tirana Bank S.A. Albania https://thebanks.eu/banks/19340 Street, Tirana, Albania, 2269 616, 2233 417, h...
11 Union Bank S.A. Albania https://thebanks.eu/banks/19341 Blv. "Zogu I", 13 floor building, in front of ...
12 United Bank of Albania S.A. Albania https://thebanks.eu/banks/19342 Municipal Borough nr. 7, street, 1023, Tirana,...
如果您只想使用selenium,那么
headless browser
和undetected_chrome
在这里没有帮助。两者都会被 Cloudflare 阻止。如果您在本地电脑上使用本地浏览器运行它,它就可以正常工作。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
from lxml import etree
# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222') # Add this option
chrome_options.page_load_strategy = 'eager'
# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(5) # Let the page load completely
html = driver.page_source
dom = etree.HTML(str(html))
driver.quit()
# here we try to extract contact details
contact_details = []
contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
for contact in contacts_nodes:
contact_str = contact.text
# Web site link is inside 'a' tag hence using some conditions
if not contact_str:
contact_str = contact.xpath(".//a/@href")[0]
# email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
if (contact_str and contact_str.count("email") > 0):
json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
data_dict = json.loads(json_str)
contact_str = data_dict["email"]
contact_details.append(contact_str.strip())
return ", ".join(contact_details)
# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search?bank=&country=Albania",
# "https://thebanks.eu/search?bank=&country=Andorra",
# Add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
# Create a new instance of the Chrome driver
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(5) # Let the page load completely
html = driver.page_source
dom = etree.HTML(str(html))
# Close the WebDriver
driver.quit()
bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")
for bank in bank_details:
bank_info = {}
bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
contacts = scrape_bank_data_with_selenium(bank_url)
bank_info["Contacts"] = contacts
print(bank_info)
bank_data.append(bank_info)
time.sleep(1)
# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# Print the DataFrame
print(df)
输出:
Bank Name Country Website Contacts
0 Alpha Bank - Albania S.A. Albania https://thebanks.eu/banks/19331 Street of Kavaja, G - KAM Business Center, 2 f...
1 American Bank of Investments S.A. Albania https://thebanks.eu/banks/19332 Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2 Bank of Albania Albania https://thebanks.eu/banks/19343 Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3 Banka Kombetare Tregstare SH.A. Albania https://thebanks.eu/banks/19336 Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4 Credins Bank S.A. Albania https://thebanks.eu/banks/19333 Municipal Borough no. 5, street "Vaso Pasha", ...
5 First Investment Bank, Albania S.A. Albania https://thebanks.eu/banks/19334 Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6 Intesa Sanpaolo Bank Albania S.A. Albania https://thebanks.eu/banks/19335 Street “Ismail Qemali”, No. 27, Tirana, Albani...
7 OTP Bank Albania S.A Albania https://thebanks.eu/banks/19337 Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8 Procredit Bank S.A. Albania https://thebanks.eu/banks/19338 Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9 Raiffeisen Bank S.A. Albania https://thebanks.eu/banks/19339 Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10 Tirana Bank S.A. Albania https://thebanks.eu/banks/19340 Street, Tirana, Albania, 2269 616, 2233 417, h...
11 Union Bank S.A. Albania https://thebanks.eu/banks/19341 Blv. "Zogu I", 13 floor building, in front of ...
12 United Bank of Albania S.A. Albania https://thebanks.eu/banks/19342 Municipal Borough nr. 7, street, 1023, Tirana,...