我有 100 个网站的列表,我想对这些网站进行 ping 操作以查看它们是否在线。我想在名为“状态”的新字段中记录每条记录返回的状态。我已将它们存储在数据框中,并希望使用 apply 函数来并行化练习,利用笔记本电脑上多达 8 个内核的优势。目前大约需要 3 分 30 秒,我天真地希望能将其缩短到 30 秒以内。我尝试过更快但没有成功。我更喜欢某种应用函数,但愿意使用多处理/多线程模块。我不是程序员,所以这确实是我目前能力的极限。 感谢任何想法/建议
import pandas as pd
import requests
from urllib.parse import urlparse
import urllib3
import swifter
#Load data to dataframe
#List of sites
siteList=[['1','https://www.facebook.com'],[2,'https://www.instagram.com'], [3,'https://www.mail.com'],[4,'https://www.thegrumpyscarecrow.com/']]
df = pd.DataFrame(siteList, columns=['id','site'])
#functions
def getStatusCode(url):
try:
r = requests.head(url, verify=False, timeout=5)
return (r.status_code)
except:
return -1
#Run the script
df['status'] = df.swifter.allow_dask_on_strings(enable=True).apply(lambda x: getStatusCode(x['site']), axis=1, result_type='expand')
代替
swifter
,您可以使用 ThreadPoolExecutor
:
from concurrent.futures import ThreadPoolExecutor
from requests.exceptions import ConnectionError
requests.urllib3.disable_warnings()
def getStatusCode(url):
try:
r = requests.head(url, verify=False, timeout=5)
status = r.status_code
except ConnectionError:
status = -1
return status
with ThreadPoolExecutor() as executor:
status = executor.map(getStatusCode, df['site'])
df['status'] = list(status)
输出:
>>> df
id site status
0 1 https://www.facebook.com 200
1 2 https://www.instagram.com 200
2 3 https://www.mail.com 200
3 4 https://www.thegrumpyscarecrow.com/ -1