Python自动化抓取程序-有时会获得511错误代码,而有时却没有

问题描述 投票:0回答:1

我正在尝试创建一个使用Selenium打开Firefox的程序,使用BrowserMobProxy获取HAR文件,获取文件内部的链接,该链接将您带到JSON页面。该程序每5秒钟抓取一次HAR文件并抓取JSON数据。问题是,有时我尝试抓取时会出现511错误-

<!DOCTYPE html><html><head><title>Apache Tomcat/8.0.32 (Ubuntu) - Error report</title><style type="text/css">H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}.line {height: 1px; background-color: #525D76; border: none;}</style> </head><body><h1>**H**TTP Status 511 - something went wrong with your request**** (2)</h1><div class="line"></div><p><b>type</b> Status report</p><p><b>message</b> <u>**something went wrong with your request** (2)</u></p><p><b>description</b> <u>**T**he client needs to authenticate to gain network access.****</u></p><hr class="line"><h3>Apache Tomcat/8.0.32 (Ubuntu)</h3></body></html>

请注意客户端需要进行身份验证才能获得网络访问权限的511错误代码。

有时它确实会成功并返回想要的字典-

{"alerts":[{"country":"IL","nThumbsUp":2,"city"... 

为什么?

可能很重要的是,JSON页面的有效时间约为1-3秒,但是我测量了,直到程序获取数据为止,它约为0.0000385秒,所以这似乎不是问题。

目前,我的理论是,因为该程序每隔x秒就会刮除一次数据,所以连接断开了,但是我想它会抛出一个巨大的错误,我的第二个揭穿理论是,这只是一个限制速率的问题,因此我使用time.sleep()并将其暂停了3秒钟,但仍然没有成功。

如果您提出一种解决方法,或者将我的代码指向错误,这将对您有很大的帮助。

代码(现在有点脏,但是我仍然没有机会修复它)

import os
import json
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from browsermobproxy import Server
import schedule
import requests
import timeit
import time

i = 1

print("""Waze Police Scraper

Waze Police Scraper will open the Mozilla Firefox browser, onto Waze's live map website.
It'll scrape all the police's lcoations from your preffered lcoation, including police traps that are voted by Waze's users.
Near every cop location it'll show the Number of upvotes, and downvotes and from that the probability of the users's report being true.

Instructions:


""")


def personalised_info():
    auto_or_manual = input("Do you want the software to scrape the waze map manually or Automatically? ")

    sec = input("Every how much seconds do you want the scraper to scrape the data? maximum is 30 seconds, while minimum is 5 seconds. By not inputting anything the value will be set to the recommended value of 5 seconds. ")
    return auto_or_manual, sec

def start_server():
    global server, proxy, driver
    server = Server("C:\\Users\\Yahav\\Downloads\\browsermob-proxy-2.1.4-bin\\browsermob-proxy-2.1.4\\bin\\browsermob-proxy")

    server.start()
    proxy = server.create_proxy()

    #proxy.wait_for_traffic_to_stop(6000, 9000)


    profile = webdriver.FirefoxProfile()
    profile.set_proxy(proxy.selenium_proxy())
    driver = webdriver.Firefox(executable_path = "C:\\Users\\Yahav\\Downloads\\geckodriver-v0.26.0-win64\\geckodriver.exe", firefox_profile=profile)
    # Navigate to the application home page
    driver.get("https://www.waze.com/livemap?utm_source=waze_website&utm_campaign=waze_website")

urls = []
t = 1
data_parsed = {}
inner_nested_data_parsed = {}
data_list = []

def get_data(urls, t, data_parsed, inner_nested_data_parsed):
    start = timeit.timeit()  # Measure time
    global i
    #tag the har(network logs) with a name
    har = proxy.new_har("waze_{0}.format(i)")

    # Finding the URL requests where the data is stored in JSON format
    har = str(har)
    str_1 = "https://www.waze.com/il-rtserver/web/TGeoRSS?"
    str_2 = "&types=alerts%2Ctraffic%2Cusers"

    indx_1 = har.find(str_1)
    indx_2 = har.find(str_2)

    url = har[indx_1:indx_2]

    url = url + str_2

    urls.append(url)

    print(urls)

    for d in urls:
        if d == str_2:
            data = {}
        if d != str_2:
            data_request = requests.get(url)
            time.sleep(3)
            data = data_request.text  #Getting data
            end = timeit.timeit()  # Measure time
            data_list.append(data)
            print(type(data))
            print(end - start)  #Time to get data

    if url == "&types=alerts%2Ctraffic%2Cusers":  # If user not moving than 'url' will be equal to the string
        print("Move your cursor to your preffered location.")
    else:
        if type(data) is dict:
            for x in range(len(data["alerts"])):
                if (data["alerts"][x]["type"]) == "POLICE":
                    inner_nested_data_parsed["type"] = (data["alerts"][x]["type"])
                    if data["alerts"][x]["subtype"] != "":  # Not working for some reason
                        inner_nested_data_parsed["subtype"] = (data["alerts"][x]["subtype"])
                    else:
                        True # Just to fill the space
                    inner_nested_data_parsed["country"] = (data["alerts"][x]["country"])
                    inner_nested_data_parsed["nThumbsUp"] = (data["alerts"][x]["nThumbsUp"])
                    inner_nested_data_parsed["confidence"] = (data["alerts"][x]["confidence"])
                    inner_nested_data_parsed["reliability"] = (data["alerts"][x]["reliability"])
                    inner_nested_data_parsed["speed"] = (data["alerts"][x]["speed"])
                    inner_nested_data_parsed["location_x"] = (data["alerts"][x]["location"]["x"])
                    inner_nested_data_parsed["location_y"] = (data["alerts"][x]["location"]["y"])

                    data_parsed[t] = inner_nested_data_parsed

                    t += 1
                    inner_nested_data_parsed = {}  # resets the dictionary so the elements in the list "alerts" won't be added to the same value of "t" in the dictionary "data_parsed"
                else:
                    continue
        else:
            print("fuck", type(data))

    print(data)
    """ # Logs to file
    path_log_file = "demofile3.txt"
    if os.path.exists(path_log_file):  #Checks if file exists
        f = open(path_log_file, "w")
        print(data)
        f.write(str(data))
        f.flush()
        f.close()

    else:
        f = open(path_log_file, "x")
        f = open(path_log_file, "w")
        f.write(str(data))
        f.flush()
        f.close()
    """

    server.stop()
    # close the browser window
    #driver.quit()
    i += 1
    return i

print(data_parsed)

auto_or_manual, sec = personalised_info()

if auto_or_manual == "A":
    if not sec:  # Default
        sec = 10
        print(True, 4)
        start_server() 
        schedule.every(sec).seconds.do(get_data, urls, t, data_parsed, inner_nested_data_parsed)
    if sec.isdigit() == True:  # If input is digit
        print(True, 4)
        start_server() 
        schedule.every(int(sec)).seconds.do(get_data, urls, t, data_parsed, inner_nested_data_parsed)
    else:
        print("Please enter a valid number.")
        personalised_info()

else:
    print(None)
    #Manual
#proxy.new_har("waze")


#driver.get("about:preferences#privacy")

while True:  # User defined 
    schedule.run_pending()
python html selenium web networking
1个回答
0
投票

状态码511表示您需要进行身份验证才能访问更多数据。

该公司可能设置了数据限制来阻止未经授权的抓取。请务必阅读其使用条款和条件。

© www.soinside.com 2019 - 2024. All rights reserved.