我正在研究python网络抓取项目。我正在尝试从中获取数据的网站包含有关在印度销售的所有药品的信息。该网站要求用户登录后才能访问此信息。
我想访问此URL https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand
中的所有链接并将其存储在数组中。
这是我登录网站的代码
##################################### Method 1
import mechanize
import http.cookiejar as cookielib
from bs4 import BeautifulSoup
import html2text
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
br.open('https://sso.mims.com/Account/SignIn')
# View available forms
for f in br.forms():
print(f)
br.select_form(nr=0)
# User credentials
br.form['EmailAddress'] = <USERNAME>
br.form['Password'] = <PASSWORD>
# Login
br.submit()
print(br.open('https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand').read())
但是问题在于,提交凭据后,会弹出一个中间页面,其中包含以下信息。
You will be redirected to your destination shortly.
此页面提交了一个隐藏的表单,然后才显示所需的结束页面。我要访问结束页面。但是br.open('https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand').read()
访问中间页并打印结果。
我如何等待中间页提交隐藏的表单,然后访问结束页的内容?
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from time import sleep
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 10)
driver.maximize_window()
driver.get("https://sso.mims.com/")
el = wait.until(EC.element_to_be_clickable((By.ID, "EmailAddress")))
el.send_keys("[email protected]")
el = wait.until(EC.element_to_be_clickable((By.ID, "Password")))
el.send_keys("password")
el = wait.until(EC.element_to_be_clickable((By.ID, "btnSubmit")))
el.click()
wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "profile-section-header"))) # we logged in successfully
driver.get("http://mims.com/india/browse/alphabet/a?cat=drug")
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "searchicon")))
print(driver.page_source)
# do what you need with the source code