加载Selenium,但不打印所有HTML

问题描述 投票:1回答:2

我试图使用Python和Selenium来从网站上抓取动态加载的数据。问题是,只有大约一半的数据被报告为存在,而实际上它们都应该存在。即使在打印出所有页面内容之前使用暂停,或者通过类搜索简单查找元素,似乎也没有解决方案。该网站的网址是https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909。如您所见,有13个主要部分,但我只能从前四个游戏中检索数据。为了最好地显示问题,我将附加用于打印整个页面的内部HTML的代码,以显示已加载和未加载数据之间的差异。

from selenium import webdriver
import requests

url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
print(driver.execute_script("return document.documentElement.innerText;"))

编辑:问题不在于等待时间,因为我一行一行地运行它并完全等待它加载。看起来问题归结为selenium没有抓住页面上所有JS加载的文本,如下面答案中的控制台输出所示。

python selenium selenium-webdriver web-scraping webdriverwait
2个回答
1
投票

@sudonym的分析是正确的方向。在尝试通过execute_script()方法提取它们之前,您需要引导WebDriverWait以使所需元素可见,如下所示:

  • 代码块: # -*- coding: UTF-8 -*- from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909" driver = webdriver.Chrome() driver.get(url) WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]"))) print(driver.execute_script("return document.documentElement.innerText;"))
  • 控制台输出: SPORTSBOOK REVIEW Home Best Sportsbooks Rating Guide Blacklist Bonuses BETTING ODDS FREE PICKS Sports Picks NFL College Football NBA NCAAB MLB NHL More Sports How to Bet Tools FORUM Home Players Talk Sportsbooks & Industry Newbie Forum Handicapper Think Tank David Malinsky's Point Blank Service Plays Bitcoin Sports Betting NBA Betting NFL Betting NCAAF Betting MLB Betting NHL Betting CONTESTS EARN BETPOINTS What Are Betpoints? SBR Sportsbook SBR Casino SBR Racebook SBR Poker SBR Store Today NFL NBA NHL MLB College Football NCAA Basketball Soccer Soccer Odds Major League Soccer UEFA Champions League UEFA Nations League UEFA Europa League English Premier League World Cup 2022 Tennis Tennis Odds ATP WTA UFC Boxing More Sports CFL WNBA AFL Betting Odds/NFL Odds/Consensus TODAY | YESTERDAY | DATE ? Login ? Settings ? Bet Tracker ? Bet Card ? Favorites NFL Consensus for Sep 09, 2018 USA - National Football League Sunday Sep 09, 2018 01:00 PM / Pittsburgh vs Cleveland 453 Pittsburgh 454 Cleveland Current Line -3½+105 +3½-115 Wagers Placed 10040 54.07% 8530 45.93% Amount Wagered $381,520.00 56.10% $298,550.00 43.90% Average Bet Size $38.00 $35.00 SBR Contest Best Bets 22 9 01:00 PM / San Francisco vs Minnesota 455 San Francisco 456 Minnesota Current Line +6-102 -6-108 Wagers Placed 6250 41.25% 8900 58.75% Amount Wagered $175,000.00 29.50% $418,300.00 70.50% Average Bet Size $28.00 $47.00 SBR Contest Best Bets 5 19 01:00 PM / Cincinnati vs Indianapolis 457 Cincinnati 458 Indianapolis Current Line -1-104 +1-106 Wagers Placed 11640 66.36% 5900 33.64% Amount Wagered $1,338,600.00 85.65% $224,200.00 14.35% Average Bet Size $115.00 $38.00 SBR Contest Best Bets 23 12 01:00 PM / Buffalo vs Baltimore 459 Buffalo 460 Baltimore Current Line +7½-103 -7½-107 Wagers Placed 5220 33.83% 10210 66.17% Amount Wagered $78,300.00 16.79% $387,980.00 83.21% Average Bet Size $15.00 $38.00 SBR Contest Best Bets 5 17 01:00 PM / Jacksonville vs N.Y. Giants 461 Jacksonville 462 N.Y. Giants 01:00 PM / Tampa Bay vs New Orleans 463 Tampa Bay 464 New Orleans 01:00 PM / Houston vs New England 465 Houston 466 New England 01:00 PM / Tennessee vs Miami 467 Tennessee 468 Miami 04:05 PM / Kansas City vs L.A. Chargers 469 Kansas City 470 L.A. Chargers 04:25 PM / Seattle vs Denver 471 Seattle 472 Denver 04:25 PM / Dallas vs Carolina 473 Dallas 474 Carolina 04:25 PM / Washington vs Arizona 475 Washington 476 Arizona 08:20 PM / Chicago vs Green Bay 477 Chicago 478 Green Bay Media Site Map Terms of use Contact Us Privacy Policy DMCA 18+. Gamble Responsibly. © Sportsbook Review. All Rights Reserved.

1
投票

如果有很多WebDriverWait调用并且考虑到减少运行时间的兴趣,那么这个解决方案是值得考虑的 - 否则就去DebanjanB的方法

你需要等待一段时间才能让你的html完全加载。此外,您可以设置timeout以执行脚本。要添加无条件等待driver.get(URL)在selenium,driver.set_page_load_timeout(n)n = time/seconds和循环:

driver.set_page_load_timeout(n)         # Set timeout of n seconds for page load
loading_finished = 0                    # Set flag to 0
while loading_finished == 0:            # Repeat while flag = 0
    try:
       sleep(random.uniform(0.1, 0.5))  # wait some time
       website = driver.get(URL)        # try to load for n seconds
       loading_finished = 1             # Set flag to 1 and exit while loop
       logger.info("website loaded")    # Indicate load success
    except:
       logger.warn("timeout - retry")   # Indicate load fail
else:                                   # If flag == 1
    driver.set_script_timeout(n)        # Set timeout of n seconds for script  
    script_finished = 0                 # Set flag to 0
    while script_finished == 0          # Second loop
       try:
          print driver.execute_script("return document.documentElement.innerText;")       
          script_finished = 1           # Set flag to 1
          logger.info("script done")    # Indicate script done
       except:                          
          logger.warn("script timeout") 
    else:
        logger.info("if you're still missing html here, increase timeout")
© www.soinside.com 2019 - 2024. All rights reserved.