Headless chrome 和 html 解析器字符串

问题描述 投票:0回答:2

我目前正在使用 selenium 和 BeautifulSoup 来抓取网站,但我遇到了两个主要问题,首先,我无法让 Chrome 以无头模式启动,并且它说有多个意外的输入结束( photo of said errors)。我遇到的另一个问题是,我在包含“html.parser”的行上不断收到错误,指出“str”不是可调用对象。任何有关这些问题的建议将不胜感激,谢谢。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup

#config options
options = Options()
options.headless = True

# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'

# Connect to the URL
browser = webdriver.Chrome(options=options, executable_path='D:\chromedriver') #chrome_options=options
browser.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(browser.page_source(), "html.parser")
browser.quit()

# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value')
print(links)
python selenium web-scraping beautifulsoup headless-browser
2个回答
0
投票

对于无头你需要这样调用:

from selenium import webdriver

options = webdriver.ChromeOptions()
...

page_source 不是方法。所以你需要去掉括号:

browser.page_source

0
投票

为了以无头模式启动chrome,并使用BeautifulSoup4将内容解析为html,你可以这样做:

#Importing necessary packages
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 

url = 'https://tokcount.com/?user=mrsam993' 

options = webdriver.ChromeOptions()  
options.headless = True 

with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) as driver: #modified 
    driver.get(url)
    
    print("Page URL: ", driver.current_url)
    print("Page title: ", driver.title)

    #Get the source page
    html = driver.page_source

ParsedContent = soup(html, 'html.parser')
ParsedContent
 

确保您拥有以下软件包:Selenium、webdriver 管理器。

pip install selenium
pip install webdriver_manager
© www.soinside.com 2019 - 2024. All rights reserved.