不能使用https代理以及在基于异步的脚本中重用同一会话

问题描述 投票:1回答:1

我正在尝试在利用asyncio库的异步请求中使用https代理。在使用http代理服务器时,有一个明确的指令here,但是在使用https代理服务器的情况下我会陷入困境。此外,我想重用同一会话,而不是每次发送请求时都创建一个新会话。

我到目前为止已经尝试过[proxies used within the script are directly taken from a free proxy site, so consider them as placeholders):

import asyncio
import aiohttp
from bs4 import BeautifulSoup

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

async def get_text(url):
    global proxies,proxy_url
    proxy = f'http://{proxy_url}'
    print("trying using:",proxy)
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url,proxy=proxy,ssl=False) as resp:
                return await resp.text()
        except Exception:
            proxy_url = proxies.pop()
            return await get_text(url)

async def field_info(field_link):              
    text = await get_text(field_link)          
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

if __name__ == '__main__':
    proxy_url = None
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
    loop.run_until_complete(future)
    loop.close()

如何在脚本中使用https代理以及重用相同的session

python python-3.x web-scraping python-asyncio aiohttp
1个回答
1
投票

此脚本创建字典proxy_session_map,其中键是代理,值是会话。这样我们就知道哪个代理属于哪个会话。

如果使用代理时发生错误,我将此代理添加到设置的disabled_proxies中,因此不再使用该代理:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

from random import choice

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

disabled_proxies = set()

proxy_session_map = {}

async def get_text(url):
    while True:
        try:
            available_proxies = [p for p in proxies if p not in disabled_proxies]

            if available_proxies:
                proxy = choice(available_proxies)
            else:
                proxy = None

            if proxy not in proxy_session_map:
                proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))

            print("trying using:",proxy)

            async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
                return await resp.text()

        except Exception as e:
            if proxy:
                print("error, disabling:",proxy)
                disabled_proxies.add(proxy)
            else:
                # we haven't used proxy, so return empty string
                return ''


async def field_info(field_link):
    text = await get_text(field_link)
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

async def main():
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    tasks = [field_info(url) for url in links]

    await asyncio.gather(
        *tasks
    )

    # close all sessions:
    for s in proxy_session_map.values():
        await s.close()

if __name__ == '__main__':
    asyncio.run(main())

打印(例如):

trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrap instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?

... and so on.
© www.soinside.com 2019 - 2024. All rights reserved.