从杂货店抓取销售价格——我走在正确的轨道上还是有更简单的方法?

问题描述 投票:0回答:2

我对这一切都不熟悉,这是我的第一个 real 编码项目所以如果答案很明显请原谅我:)

我正在尝试使用

BeautifulSoup
从 [我的杂货店] 中提取促销商品,但我需要的
href
被掩埋了。最终,我想要一种最简单的方法来将待售商品与我的食谱数据库进行比较,以自动制定膳食计划。我花了几天时间尝试学习如何抓取网页,但大多数教程或问题都涵盖了布局更简单的网站。

我最初的方法是像大多数教程描述的那样使用

BeautifulSoup
抓取 html,使用以下内容,但它无法访问
<body>

import requests

from bs4 import BeautifulSoup

page = requests.get('https://www.realcanadiansuperstore.ca/deals/all?sort=relevance&category=27985').text
soup = BeautifulSoup(page, 'html.parser')

print(soup.select("li.product-tile-group__list__item:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(3) > div:nth-child(1) > h3:nth-child(1) > a:nth-child(1)"))

经过一番搜索后,我发现需要加载 DOM 树才能访问我需要的 html 部分,

Selenium
是我最好的选择。现在又经过几个小时的故障排除,我已经设法让我的代码(大部分时间)导航到正确的页面,昨晚它甚至设法抓取了一些 html(虽然不是正确的部分,我想我'已经更正了,但它还没有运行到足以再次告诉......)。

我当前的代码如下所示:

import os
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.service import Service as FirefoxService
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from webdriver_manager.firefox import GeckoDriverManager

options = Options()
options.headless = True

service = FirefoxService(executable_path=GeckoDriverManager().install())
driver = webdriver.Firefox(service=service, options=options)
driver.maximize_window()
print("Headless=", options.headless)
driver.get("https://www.realcanadiansuperstore.ca/deals/all?sort=relevance&category=27985")
print("-Page launched")
print("Wait for page to load location selection and click Ontario")
ontarioButton = '/html/body/div[1]/div/div[6]/div[2]/div/div/ul/li[4]/button'
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, ontarioButton))).click()
print("-Ontario clicked")
print("Wait for page to load location entry and send city")
WebDriverWait(driver, 30).until(EC.invisibility_of_element_located((By.CLASS_NAME, 'region-selector--is-loading')))
WebDriverWait(driver, 20).until(
    EC.element_to_be_clickable((By.XPATH, '//*[@id="location-search__search__input"]'))).click()
WebDriverWait(driver, 20).until(
    EC.element_to_be_clickable((By.XPATH, '//*[@id="location-search__search__input"]'))).send_keys('Oshawa',
                                                                                                   Keys.RETURN)
print("-Sent Oshawa")
print("Wait until Gibb flyer is clickable")
privacyClose = '.lds__privacy-policy__btnClose'
privacyPolicy = WebDriverWait(driver, 200).until(EC.element_to_be_clickable((By.CSS_SELECTOR, privacyClose)))
if WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '/html/body/div[2]/div/div/button'))):
    print("Closing privacy policy")
    driver.implicitly_wait(5)
    privacyPolicy.click()
    print("-PP closed")

storeFlyer = '/html/body/div[1]/div/div[2]/main/div/div/div/div/div[2]/div[1]/div[1]/div/div[2]/button'
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, storeFlyer))).click()
print("-Gibb clicked")

foodButton = '/html/body/div[1]/div/div[2]/main/div/div/div/div/div[2]/div/div[1]/div/div/div/div[1]/div/div/ul/li[1]/button'
WebDriverWait(driver, 200).until(EC.element_to_be_clickable((By.XPATH, foodButton))).click()

os.system('clear')

print('ALL DEALS:')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

print(soup.find_all('a'))
driver.quit()

这大部分时间都有效,但有时会挂断:

Traceback (most recent call last):
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/SuperstoreScraper0.04.py", line 40, in <module>
    WebDriverWait(driver, 20000000).until(EC.element_to_be_clickable((By.XPATH, storeFlyer))).click()
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 81, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 740, in _execute
    return self._parent.execute(command, params)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
    self.error_handler.check_response(response)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementClickInterceptedException: Message: Element <button class="flyers-location-search-item__main__content__button"> is not clickable at point (483,666) because another element <div class="lds__privacy-policy__innerWrapper"> obscures it
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:183:5
ElementClickInterceptedError@chrome://remote/content/shared/webdriver/Errors.jsm:282:5
webdriverClickElement@chrome://remote/content/marionette/interaction.js:166:11
interaction.clickElement@chrome://remote/content/marionette/interaction.js:125:11
clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:203:24
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:91:31

这是我试图解决的

selenium.common.exceptions.ElementClickInterceptedException: Message: Element <button class="flyers-location-search-item__main__content__button"> is not clickable at point (483,666) because another element <div class="lds__privacy-policy__innerWrapper"> obscures it
否则它会 100% 地扔掉它。但我现在遇到的主要问题是:

  File "/mnt/1TB/PythonProjects/SuperstoreScraper/SuperstoreScraper0.04.py", line 36, in <module>
    privacyPolicy.click()
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 81, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 740, in _execute
    return self._parent.execute(command, params)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
    self.error_handler.check_response(response)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message: Element <button class="lds__privacy-policy__btnClose" type="button"> could not be scrolled into view
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:183:5
ElementNotInteractableError@chrome://remote/content/shared/webdriver/Errors.jsm:293:5
webdriverClickElement@chrome://remote/content/marionette/interaction.js:156:11
interaction.clickElement@chrome://remote/content/marionette/interaction.js:125:11
clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:203:24
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:91:31

我在某处读到它需要用 java 点击,我一直看到这种变化:

WebElement element = driver.findElement(By.xpath("//a[@href='itemDetail.php?id=19']"));    
JavascriptExecutor js = (JavascriptExecutor) driver;  
js.executeScript("arguments[0].scrollIntoView();",element);
element.click();

但是

JavascriptExecutor
没有被识别,我很难找到更多关于下一步做什么的信息,除了这个 here:

“Selenium 支持 javaScriptExecutor。不需要额外的插件或附加组件。您只需要在脚本中导入 (org.openqa.selenium.JavascriptExecutor) 即可使用 JavaScriptExecutor。”

但似乎没有任何变体能够让

JavascriptExecutor
做任何事情......

我已经推迟提出任何问题,因为我喜欢解决问题的挑战,但我开始觉得我错过了什么。 我在正确的轨道上吗?还是有更简单的方法来解决这个问题? 提前致谢!

附言。就在我点击帖子之前,我将第 36 行的等待时间从

20
更改为
20000000
,它仍然在相同的时间内给出相同的错误。我用
WebDriverWait
错了吗?

python web-scraping beautifulsoup webdriverwait javascriptexecutor
2个回答
0
投票

我现在正在做同一个项目。通过检查我当地杂货店传单页面的网页,我发现了一个公开的字典,其中列出了价格、折扣等项目

位于:网络版块 | XHR/休养 |产品。

我可以使用 url 访问文件,但我担心位于 url 中的 acces_token 可能会因每个请求而改变。

希望这有帮助!


0
投票

我建议您在这种情况下避免使用 Selenium。正如 Antoine 所建议的,您可以使用

inspect element
功能来检查是否有 API 暴露。

我认为当你滚动到底部时会发生什么,网页会向后端发出请求以获取更多数据。正如 Antoine 所建议的,您可以模仿此请求。在网页上使用 Inspect Element,导航至

Network
,然后导航至
response
。滚动到页面底部并加载,您会看到一些新请求。

我从现在开始推荐约翰沃森鲁尼的视频。https://www.youtube.com/watch?v=DqtlR0y0suo

© www.soinside.com 2019 - 2024. All rights reserved.