使用ChromeDriver Chrome Selenium和BeautifulSoup使用Python进行输入来抓取网站

问题描述 投票:1回答:1

[我正在尝试刮擦wester union send money网站,以获取与阿根廷比索的当前“欧洲蓝”汇率。西联汇款公司是唯一一家为您提供真实汇率的​​公司,并且可以在大街上交易。如果您对在阿根廷交易货币的第二市场的发展感兴趣,请查找Dollar-Blue。

我的目标是将欧元的当前汇率转换成阿根廷比索。如果要访问该网站,则必须首先单击“接受”按钮,然后键入要将钱汇至的国家/地区的名称,只有在该步骤之后才能看到汇率。

我首先尝试通过请求进行操作,因为它无法处理Java脚本,所以我切换到了selenium,并且现在已经很接近了。

我的代码如下:

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

WesternUnion = 'https://www.westernunion.com/de/en/web/send-money'

# create a new Chrome session
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(WesternUnion)

python_button = driver.find_element_by_id('button-fraud-warning-accept')
python_button.click()

time.sleep(0.25)
python_button = driver.find_element_by_id('country')
python_button.click() #click fhsu link
time.sleep(0.15)
text_area = driver.find_element_by_id('country')
text_area.send_keys("Argentina")

soup = BeautifulSoup(driver.page_source, 'lxml')

div = soup.find('div', id="OptimusApp")
div2 = soup.find('div', class_="text-center")

问题是,如果我使用python(screenshot navigated automatic with python)进行操作,则不会显示汇率,而如果我手工进行完全相同的操作(screenshot navigated by hand),则会显示汇率。

我对抓取和python还是很陌生,有人对这个问题有简单的解决方案吗?

python selenium google-chrome selenium-webdriver selenium-chromedriver
1个回答
1
投票

我对您的代码进行了一些修改,添加了几个可选参数,执行后,我得到以下结果:

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get('https://www.westernunion.com/de/en/web/send-money')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#button-fraud-warning-accept"))).click()
    python_button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#country")))
    python_button.click()
    python_button.send_keys("Argentina")
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span#smoExchangeRate"))).text)
    
  • 控制台输出:

    1.00 EUR = Argentine Peso (ARS)
    
  • 观察:我的观察与您的观察相似,未显示汇率]:

  • “


深潜

在检查网页的DOM Tree时,您会发现<script><link>标记中的一部分引用了具有关键字dist的JavaScripts。例如:

  • <script src="/content/wucom/dist/2.7.1.8f57d9b1/js/smo-configs/smo-config.de.js"></script>
  • <link rel="stylesheet" type="text/css" href="/content/wucom/dist/2.7.1.8f57d9b1/css/responsive_css.min.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/liveform-web-vendor-f84dfc85d6.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/kampyle/liveform-web-style-a4ce961d15.css">
  • <script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-vendor-919a2c71c3.js"></script>
  • <script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-app-2c4e3adeb6.js"></script>
  • 这清楚表明该网站受Bot Management

服务提供商Distil Networks保护,并且检测到ChromeDriver的导航,随后被阻止

Distil

根据文章There Really Is Something About Distil.it...

Distil通过观察站点行为并识别刮板特有的模式来保护站点免受自动内容抓取机器人的攻击。当Distil在一个站点上识别出一个恶意bot时,它将创建一个列入黑名单的行为配置文件,并将其部署到所有客户。类似僵尸防火墙一样,Distil会检测模式并做出反应。

进一步,

"One pattern with **Selenium** was automating the theft of Web content",Distil首席执行官Rami Essaisai在上周的一次采访中说。 "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


参考

您可以在以下位置找到几个详细的讨论:

© www.soinside.com 2019 - 2024. All rights reserved.