如何使用
selenium
和google chrome
来抓取网站?
那
virtualenv
呢?是必须的吗?为什么使用它/为什么不使用virtualenv
?
#安装谷歌浏览器
wget -c wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
apt-get -f install
#安装硒
apt-get install python-dev python-pip
pip install selenium
#selenium_scrape.py
检查其是否工作的简单脚本
import time
from selenium import webdriver
driver = webdriver.Chrome()
time.sleep(5)
driver.quit()
#命令
python selenium_scrape.py
#错误
Traceback (most recent call last):
File "selenium_scrape.py", line 4, in <module>
driver = webdriver.Chrome('/lib/modules/3.16.0-4-amd64/kernel/drivers/platform/chrome')
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/chrome/webdriver.py", line 61, in __init__
self.service.start()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/common/service.py", line 74, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chrome' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home
Exception AttributeError: "'Service' object has no attribute 'process'" in <bound method Service.__del__ of <selenium.webdriver.chrome.service.Service object at 0x7f88e9347190>> ignored
#完整脚本
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
def init_driver():
driver = webdriver.Chrome()
driver.wait = WebDriverWait(driver, 5)
return driver
def lookup(driver, query):
driver.get("http://www.google.com")
try:
box = driver.wait.until(EC.presence_of_element_located(
(By.NAME, "q")))
button = driver.wait.until(EC.element_to_be_clickable(
(By.NAME, "btnK")))
box.send_keys(query)
button.click()
except TimeoutException:
print("Box or Button not found in google.com")
if __name__ == "__main__":
driver = init_driver()
lookup(driver, "Selenium")
time.sleep(5)
driver.quit()
不同的是,你不能使用打包的Chrome浏览器;你需要一个特殊的驱动程序...chromedriver。
在此处获取当前最新版本: Chromedriver
现在您有 2 个选项,要么移动下载的 chromedriver,使其始终可访问(选项 1),要么在脚本中定义如何访问它。
然后移动它,以便您使用时可以访问
webdriver.Chrome()
:
sudo mv /path/to/download/chromedriver /usr/bin
同时设置允许执行:
chmod a+x /usr/binchromedriver
或者你可以定义一个路径
import os
chr = "/Users/you/Downloads/chromedriver"
os.environ["webdriver.chrome.driver"] = chr
driver = webdriver.Chrome(chromedriver)
(注:最初的问题是关于 Chrome 的,所以我的答案是关于 Chrome 的,而不是 Firefox 的)。
对我来说,如果我只是将 chromedriver 提取到脚本所在的同一文件夹中,就可以了。
然后我这样运行
Xvfb :99 -ac -screen 0 1280x1024x16 &
echo 'Starting the test'
PATH=$PATH:. python selenimum_scrape.py
这将启动 Xvfb 并将 crome 驱动程序包含到
PATH
中。
以及对我有用的修改版本:
import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# comment this out to run on the real display
os.environ['DISPLAY'] = ':99'
def init_driver():
driver = webdriver.Chrome()
driver.wait = WebDriverWait(driver, 5)
return driver
def lookup(driver, query):
driver.get("http://www.google.com")
try:
box = driver.wait.until(EC.presence_of_element_located(
(By.NAME, "q")))
# once we type the query, this button disappears
# button = driver.wait.until(EC.element_to_be_clickable(
# (By.NAME, "btnK")))
box.send_keys(query)
button = driver.wait.until(EC.element_to_be_clickable(
(By.NAME, "btnG")))
button.click()
except TimeoutException:
print("Box or Button not found in google.com")
if __name__ == "__main__":
driver = init_driver()
lookup(driver, "Selenium")
time.sleep(5)
driver.quit()
问题(目前)是关于缩进错误。这很容易解决:
def lookup(driver, query):
driver.get("http://www.google.com")
try:
box = driver.wait.until(EC.presence_of_element_located(
(By.NAME, "q")))
button = driver.wait.until(EC.element_to_be_clickable(
(By.NAME, "btnK")))
box.send_keys(query)
button.click()
except TimeoutException:
print("Box or Button not found in google.com")