我有一个.txt文件,其中包含URL列表。 URL的结构各不相同-有些URL可能以https开头,有些可能以http开头,有些仅以www开头,而另一些仅以域名(stackoverflow.com)开头。因此,.txt文件内容的示例是:-
www.google.com
microsoft.com
https://www.yahoo.com
http://www.bing.com
我想做的是解析列表,并检查URL是否有效。为此,URL的结构必须正确,否则请求将失败。到目前为止,这是我的代码:-
import requests
with open('urls.txt', 'r') as f:
urls = f.readlines()
for url in urls:
url = url.replace('\n', '')
if not url.startswith('http'): #This is to handle just domain names and those that begin with 'www'
url = 'http://' + url
if url.startswith('http:'):
print("trying url {}".format(url))
response = requests.get(url, timeout=10)
status_code = response.status_code
if status_code == 200:
continue
else:
print("URL {} has a response code of {}".format(url, status_code))
print("encountered error. Now trying with https")
url = url.replace('http://', 'https://')
print("Now replacing http with https and trying again")
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
else:
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
我觉得我已经使这个问题有些复杂了,必须有一种更简单的尝试变体的方式(例如,域名,以'www'开头的域名,以'http'开头和'https://的域名'首先,直到确定某个站点是否正常运行(即所有变量均已耗尽)。
关于我的代码的任何建议或解决此问题的更好方法?本质上,我想处理URL的格式,以确保随后尝试检查URL的状态。
提前感谢
对于评论来说,这有点太长了,但是,是的,可以从startswith
部分开始并替换,从而简化它:
if not '//' in url:
url = 'http://' + url
response = requests.get(url, timeout=10)
等