执行Python网页抓取代码时遇到“NameError:name'tryAgain'未定义”错误[已关闭]

问题描述 投票:0回答:2

我正在尝试使用 Python 抓取网页,但遇到错误。这是我的代码的简化版本:

from email import header
import random
import time
import urllib.request
from bs4 import BeautifulSoup
import requests

main_url = "http://www.google.com"

main_page_html = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Rest of the code to scrape data

但是,我收到以下错误:

Traceback (most recent call last):
  File "test.py", line 10, in <module>
  main_page_html  = tryAgain(main_url)
NameError: name 'tryAgain' is not defined

tryAgain 函数稍后在我的代码中定义,用于处理具有间歇性连接问题的抓取 URL。这是 tryAgain 函数:

def tryAgain(passed_url):
try:
    # Code to scrape the URL
except Exception:
    # Code to retry scraping after waiting

为什么我会收到此错误,如何解决?

python compiler-errors
2个回答
1
投票

你可以这样想,Python 与某些编译语言相反,它从上到下逐行解释事物。因此,当它执行某行(这是一个简化但)时,只有上面的行才存在于 python 中。话说回来,如果你运行这个会发生什么?

from email import header
import random
import time
import urllib.request
from bs4 import BeautifulSoup
import requests

main_url = "http://www.google.com"

main_page_html  = tryAgain(main_url)

当然它说

tryAgain
未定义!您需要将执行移至定义下方(或将定义移至执行上方)。


1
投票

Python 是一种脚本语言。在调用方法/类之前始终定义它们。

当你的代码执行在下一行时

main_page_html  = tryAgain(main_url)

python 找不到方法“tryAgain”,因为它是在代码中稍后定义的。

改为这样做:

from email import header
import random
import time
import urllib.request
from bs4 import BeautifulSoup
import requests


# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)

def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url).text
        return page
    except Exception:
        while 1:
            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue 

main_url = "http://www.google.com"
main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
   for td in tr.select("td#id"):
       print(td.text)
       # For acnhors inside TD
       print(td.select("a")[0].text)
       # Value of Href attribute
       print(td.select("a")[0]["href"])

© www.soinside.com 2019 - 2024. All rights reserved.