BeautifulSoup4 和 Pandas 返回空 DataFrame 列:更新:现在在 Google-Colab 上使用 Selenium

问题描述 投票:0回答:1

我正在寻找世界银行的公开名单

我不需要分支机构和完整地址 - 但需要名称和网站。我想到数据... XML、CSV ... 具有这些字段: 银行名称、国家/地区名称或国家/地区代码(ISO 两个字母) 网站:可选:银行总部所在城市 对于每家银行,每个所在国家/地区有一条记录。 顺便说一句:特别是小银行很有趣

我发现了一个很棒的页面,非常非常全面 - 看 - 它有欧洲 9000 家银行:

从头到尾看:

https://thebanks.eu/search

**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla

**B**
https://thebanks.eu/search?bank=&country=Belgium


**U** 
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom

查看详细页面:https://thebanks.eu/banks/9563

我需要这些数据 联系方式

Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/

方法:我的方法是使用bs4、request和pandas

btw:也许我们可以从 0 计数到 100 000 - 为了获得存储在数据库中的所有银行:

查看详细页面:https://thebanks.eu/banks/9563

我在colab上运行这个:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape bank data from my URL
def scrape_bank_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # here we try to find bank name, country, and website
    bank_name = soup.find("h1", class_="entry-title").text.strip()
    country = soup.find("span", class_="country-name").text.strip()
    website = soup.find("a", class_="site-url").text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

# the list of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    #  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    bank_links = soup.find_all("div", class_="search-bank")

    for bank_link in bank_links:
        bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
        bank_info = scrape_bank_data(bank_url)
        bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)

看看返回了什么

Empty DataFrame
Columns: []
Index: []

在我看来,抓取过程存在问题。我尝试了一些不同的方法,一次又一次地检查网页上的元素,以确保我在页面上提取正确的信息。

还应该打印出一些额外的调试信息来帮助诊断问题。

更新:晚上好亲爱的@Asish M.和@eternal_white 非常感谢您的评论和分享您的想法:思考的食物:至于 Selenium - 我认为这是一个好主意 - 以及在 Google-Colab 上运行它(selenium),我从 Jacob Padilla 那里学到了 @Jacob / @user:21216449 :: 请参阅 Jacobs 页面:https://github.com/jpjacobpadilla 以及 Google-Colab-Selenium:https://github.com/jpjacobpadilla/Google-Colab-Selenium 和默认选项:

The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
    • --headless: Runs Chrome in headless mode (without a GUI). 
    • --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment. 
    • --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers. 
    • --lang=en: Sets the language to English.

嗯,我认为这种方法值得考虑:所以我们可以像这样:

在 Google Colab 中使用 Selenium 来绕过 Cloudflare(您提到的 - permanent_white)阻塞并抓取所需的数据可能是很好但可行的方法。这里有一些关于分步方法的想法 - 以及如何使用 Jacob Padilla 的 google-colab-selenium 包进行设置:

Install google-colab-selenium:
You can install the google-colab-selenium package using pip:

diff

!pip 安装 google-colab-selenium

我们还需要安装Selenium:

差异

!pip install selenium

Import Necessary Libraries:
Import the required libraries in your Colab notebook:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time

然后我们需要设置 Selenium WebDriver: 使用必要的选项配置 Chrome WebDriver:

# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)

这里我们将定义用于抓取的函数: 我们定义了一个使用 Selenium 来抓取银行数据的函数:

def scrape_bank_data_with_selenium(url):
    driver.get(url)
    time.sleep(5)  # first of all - we let the page load completely
    
    bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
    country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
    website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

然后我们可以去抓取数据:现在我们可以使用定义的函数来抓取数据:

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    # hmm - we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# now we can iterate through the URLs and scrape bank data
for url in urls:
    bank_data.append(scrape_bank_data_with_selenium(url))

# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

并且 - 一次单次拍摄:

# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx: 
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)

# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
    driver.get(url)
    time.sleep(5)  # Let the page load completely
    
    bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
    country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
    website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    # Add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    bank_data.append(scrape_bank_data_with_selenium(url))

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

# Close the WebDriver
driver.quit()

看看我有什么回来了 - 在 google-colab 上:

TypeError                                 Traceback (most recent call last)

<ipython-input-4-76a7abf92dba> in <cell line: 21>()
     19 
     20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
     22 
     23 # Define function to scrape bank data using Selenium

TypeError: WebDriver.__init__() got multiple values for argument 'options'
python web-scraping beautifulsoup
1个回答
0
投票

该网站受 cloudflare 保护,因此最好使用代理绕过

import requests
from bs4 import BeautifulSoup
from lxml import etree
import pandas as pd
from pdb import set_trace
from urllib.parse import urlencode
import json

# Get your own api_key from scrapeops or some other proxy vendor
API_KEY = "api_key"
def get_scrapeops_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
    return proxy_url

# Function to scrape bank data from my URL
def scrape_bank_data(url):
    proxy_url = get_scrapeops_url(url)
    response = requests.get(proxy_url)
    soup = BeautifulSoup(response.content, "html.parser")
    dom = etree.HTML(str(soup))

    # here we try to extract contact details
    contact_details = []
    contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
    for contact in contacts_nodes:
        contact_str = contact.text
        # Web site link is inside 'a' tag hence using some conditions
        if not contact_str:
            contact_str = contact.xpath(".//a/@href")[0]
            # email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
            if (contact_str and contact_str.count("email") > 0):
                json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
                data_dict = json.loads(json_str)
                contact_str = data_dict["email"]
        contact_details.append(contact_str.strip())

    return ", ".join(contact_details)

# the list of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search?bank=&country=Albania",
    # "https://thebanks.eu/search?bank=&country=Andorra",
    #  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    proxy_url = get_scrapeops_url(url)
    response = requests.get(proxy_url)
    soup = BeautifulSoup(response.content, "html.parser")
    dom = etree.HTML(str(soup))
    bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")

    for bank in bank_details:
        bank_info = {}
        bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
        bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
        country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
        bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
        contacts = scrape_bank_data(bank_url)
        bank_info["Contacts"] = contacts
        print(bank_info)
        bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)

输出:

                              Bank Name  Country                          Website                                           Contacts
0             Alpha Bank - Albania S.A.  Albania  https://thebanks.eu/banks/19331  Street of Kavaja, G - KAM Business Center, 2 f...
1     American Bank of Investments S.A.  Albania  https://thebanks.eu/banks/19332  Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2                       Bank of Albania  Albania  https://thebanks.eu/banks/19343  Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3       Banka Kombetare Tregstare SH.A.  Albania  https://thebanks.eu/banks/19336  Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4                     Credins Bank S.A.  Albania  https://thebanks.eu/banks/19333  Municipal Borough no. 5, street "Vaso Pasha", ...
5   First Investment Bank, Albania S.A.  Albania  https://thebanks.eu/banks/19334  Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6     Intesa Sanpaolo Bank Albania S.A.  Albania  https://thebanks.eu/banks/19335  Street “Ismail Qemali”, No. 27, Tirana, Albani...
7                  OTP Bank Albania S.A  Albania  https://thebanks.eu/banks/19337  Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8                   Procredit Bank S.A.  Albania  https://thebanks.eu/banks/19338  Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9                  Raiffeisen Bank S.A.  Albania  https://thebanks.eu/banks/19339  Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10                     Tirana Bank S.A.  Albania  https://thebanks.eu/banks/19340  Street, Tirana, Albania, 2269 616, 2233 417, h...
11                      Union Bank S.A.  Albania  https://thebanks.eu/banks/19341  Blv. "Zogu I", 13 floor building, in front of ...
12          United Bank of Albania S.A.  Albania  https://thebanks.eu/banks/19342  Municipal Borough nr. 7, street, 1023, Tirana,...

如果您只想使用selenium,那么

headless browser
undetected_chrome
在这里没有帮助。两者都会被 Cloudflare 阻止。如果您在本地电脑上使用本地浏览器运行它,它就可以正常工作。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
from lxml import etree

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option
chrome_options.page_load_strategy = 'eager'


# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    time.sleep(5)  # Let the page load completely
    html = driver.page_source
    dom = etree.HTML(str(html))
    driver.quit()

    # here we try to extract contact details
    contact_details = []
    contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
    for contact in contacts_nodes:
        contact_str = contact.text
        # Web site link is inside 'a' tag hence using some conditions
        if not contact_str:
            contact_str = contact.xpath(".//a/@href")[0]
            # email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
            if (contact_str and contact_str.count("email") > 0):
                json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
                data_dict = json.loads(json_str)
                contact_str = data_dict["email"]
        contact_details.append(contact_str.strip())

    return ", ".join(contact_details)

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search?bank=&country=Albania",
    # "https://thebanks.eu/search?bank=&country=Andorra",
    # Add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    # Create a new instance of the Chrome driver
    driver = webdriver.Chrome(options=chrome_options)

    driver.get(url)
    time.sleep(5)  # Let the page load completely
    html = driver.page_source
    dom = etree.HTML(str(html))
    # Close the WebDriver
    driver.quit()

    bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")
    for bank in bank_details:
        bank_info = {}
        bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
        bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
        country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
        bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
        contacts = scrape_bank_data_with_selenium(bank_url)
        bank_info["Contacts"] = contacts
        print(bank_info)
        bank_data.append(bank_info)
        time.sleep(1)

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

输出:

                              Bank Name  Country                          Website                                           Contacts
0             Alpha Bank - Albania S.A.  Albania  https://thebanks.eu/banks/19331  Street of Kavaja, G - KAM Business Center, 2 f...
1     American Bank of Investments S.A.  Albania  https://thebanks.eu/banks/19332  Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2                       Bank of Albania  Albania  https://thebanks.eu/banks/19343  Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3       Banka Kombetare Tregstare SH.A.  Albania  https://thebanks.eu/banks/19336  Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4                     Credins Bank S.A.  Albania  https://thebanks.eu/banks/19333  Municipal Borough no. 5, street "Vaso Pasha", ...
5   First Investment Bank, Albania S.A.  Albania  https://thebanks.eu/banks/19334  Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6     Intesa Sanpaolo Bank Albania S.A.  Albania  https://thebanks.eu/banks/19335  Street “Ismail Qemali”, No. 27, Tirana, Albani...
7                  OTP Bank Albania S.A  Albania  https://thebanks.eu/banks/19337  Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8                   Procredit Bank S.A.  Albania  https://thebanks.eu/banks/19338  Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9                  Raiffeisen Bank S.A.  Albania  https://thebanks.eu/banks/19339  Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10                     Tirana Bank S.A.  Albania  https://thebanks.eu/banks/19340  Street, Tirana, Albania, 2269 616, 2233 417, h...
11                      Union Bank S.A.  Albania  https://thebanks.eu/banks/19341  Blv. "Zogu I", 13 floor building, in front of ...
12          United Bank of Albania S.A.  Albania  https://thebanks.eu/banks/19342  Municipal Borough nr. 7, street, 1023, Tirana,...
© www.soinside.com 2019 - 2024. All rights reserved.