我使用下面的代码来获取一千个instagram账户的信息,使用asycnio。在最初的请求中,输出是正确的,但是在10-20次调用后,instagram开始返回加载页面的HTML代码。我可能在这里做错了什么?下面是python代码。
import random
import asyncio
from aiohttp import ClientSession
import urllib.request
import aiohttp
async def fetch(url, session,sem):
print("------")
print(url)
async with session.get(url = url) as response:
print(await response.text())
await response.text()
# exit()
if response.status == 200:
await sem.acquire()
fname = url[22:]
fname = fname.split('/')
fname = fname[0] + '.txt'
f = open(fname , 'w')
f.write(str(await response.text()))
sem.release()
# return (await response.text())
async def run(url_list):
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(2)
# Create client session that will ensure we dont open new connection
# per each request.
async with ClientSession() as session:
for url in url_list:
task = asyncio.ensure_future(fetch(url, session,sem))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
# making the url list here
url_list = []
file = open('url.txt', 'r')
for url in file:
url_list.append(url)
print(url_list)
import time
old = time.time()
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(url_list))
loop.run_until_complete(future)
print(time.time() - old)
下面是一些来自url.txt文件的URL地址
https://instagram.com/johanna_kre/?__a=1
https://instagram.com/channie_f/?__a=1
https://instagram.com/lilakuh68/?__a=1
https://instagram.com/nataliacallisto/?__a=1
https://instagram.com/edbastian/?__a=1
https://instagram.com/sylvana.h/?__a=1
https://instagram.com/munich_bombon/?__a=1
https://instagram.com/younotus/?__a=1
https://instagram.com/meet.herbert/?__a=1
https://instagram.com/inaaogo/?__a=1
https://instagram.com/dennisaogo/?__a=1
https://instagram.com/mrslight__/?__a=1
https://instagram.com/reneturrek/?__a=1
https://instagram.com/_eeasyyy/?__a=1
https://instagram.com/sentinobln/?__a=1
https://instagram.com/eri.ka_g/?__a=1
你的semaphore没有按照你的要求限制请求,你应该在请求之前获取它,而不是在处理内容之前。
在你目前的实现中,你发出了100个并发请求(aiohttp的客户端默认限制),但每次只处理两个响应(然而此时从服务器的角度看,请求已经被处理了)。
使用。
async def fetch(url, session,sem):
print("------")
print(url)
await sem.acquire()
async with session.get(url = url) as response:
print(await response.text())
await response.text()
...
sem.release()
...