我编写了一个简单的代码,我从一个网站上抓取了食谱。每个食谱的网址都写在 Excel 上,我用 pandas 读取它。我有一个奇怪的问题,例如我想抓取 100 个食谱,当 for 转到 i = 21 时它会中断并且不会加载页面(无限加载网站),但是当我从 20 开始 for 循环时它在 41 处中断。重新运行代码并可以在 i = 17 处中断,这是相当随机的。 有人有这个类似的问题吗? 网站:https://akispetretzikis.com/en 谢谢你
def mainProgram(start):
now = datetime.now()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('--no-sandbox')
options.add_argument('--disable-infobars')
options.add_argument('--disable-dev-shm-usage')
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
theDictionary = {"Link": [], "Name": [], "Time": [], "Difficulty": [],
"Merides": [], "Ingredients": [],
"ThermidesPer100gr": [], "ThermidesAnaMerida": []}
driver = webdriver.Chrome(executable_path=r'/usr/lib/chromium-browser/chromedriver',
options=options)
driver.set_window_size(1280, 960)
thePath = os.path.join(os.path.expanduser("~"), "Desktop", "ScrapeRecipes",
"Cooking"+str(now.year)+".xlsx")
thePathReadExcel = os.path.join(os.path.expanduser("~"), "Desktop",
"CookingUrls"+str(now.year)+".xlsx")
UrlOfRecipes = readExcel(thePath=thePathReadExcel)
try:
Length = len(UrlOfRecipes)
print(Length)
Length = 100#e.g. 100 actual Length over 1k
for i in range(start, Length, 1):
driver.delete_all_cookies()
driver.get(UrlOfRecipes["Link"][i])
wait = WebDriverWait(driver, 20 + round(random.uniform(0, 4), 2))
time.sleep(30 + round(random.uniform(0, 4), 2)) # mandatory sleep
theDictionary["Link"].append(UrlOfRecipes["Link"][i])
theDictionary = getDataFromRecipe(driver, theDictionary)
time.sleep(20 + round(random.uniform(0, 4), 2))
print(i)
except Exception as e:
print(e)
writeOnExcel(theDict, thePath)
我也面临着同样的问题。但我仍然没有找到任何解决方案。