在Python 3.x中使用URL混合请求

问题描述 投票:-1回答:1

我有一个.txt文件,其中包含URL列表。 URL的结构各不相同-有些URL可能以https开头,有些可能以http开头,有些仅以www开头,而另一些仅以域名(stackoverflow.com)开头。因此,.txt文件内容的示例是:-

www.google.com
microsoft.com
https://www.yahoo.com
http://www.bing.com

我想做的是解析列表,并检查URL是否有效。为此,URL的结构必须正确,否则请求将失败。到目前为止,这是我的代码:-

import requests

with open('urls.txt', 'r') as f:
    urls = f.readlines()
    for url in urls:
        url = url.replace('\n', '')
        if not url.startswith('http'):  #This is to handle just domain names and those that begin with 'www'
            url = 'http://' + url
        if url.startswith('http:'):
            print("trying url {}".format(url))
            response = requests.get(url, timeout=10)
            status_code = response.status_code
            if status_code == 200:
                continue
            else:
                print("URL {} has a response code of {}".format(url,  status_code))
                print("encountered error. Now trying with https")
                url = url.replace('http://', 'https://')
                print("Now replacing http with https and trying again")
                response = requests.get(url, timeout=10)
                status_code = response.status_code
                print("URL {} has a response code of {}".format(url,  status_code))
        else:
            response = requests.get(url, timeout=10)
            status_code = response.status_code
            print("URL {} has a response code of {}".format(url,  status_code))

我觉得我已经使这个问题有些复杂了,必须有一种更简单的尝试变体的方式(例如,域名,以'www'开头的域名,以'http'开头和'https://的域名'首先,直到确定某个站点是否正常运行(即所有变量均已耗尽)。

关于我的代码的任何建议或解决此问题的更好方法?本质上,我想处理URL的格式,以确保随后尝试检查URL的状态。

提前感谢

python python-3.x python-requests http-status-codes
1个回答
0
投票

对于评论来说,这有点太长了,但是,是的,可以从startswith部分开始并替换,从而简化它:

if not '//' in url:
      url = 'http://' + url
      response = requests.get(url, timeout=10)

© www.soinside.com 2019 - 2024. All rights reserved.