在 Google Cloud Functions 或 Run 中使用 requests_html

问题描述 投票:0回答:2

我有一个相当基本的抓取应用程序,我想在 Google Cloud 环境中运行,我正在使用 requests_html 异步库,它在我的本地环境中运行良好,但是我无法弄清楚如何在其中运行它谷歌云已经摆弄它好几天了。 该应用程序的目的是使用 html.arender 简单地渲染一些 javascript 页面(包含在 urls 数组中),然后使用 BeautifulSoup 提取一些特定标签的内容(从标签数组中)。

我不断收到的错误消息是:

“信号仅在主解释器的主线程中起作用”

     Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/functions_framework/__init__.py", line 99, in view_func
    return function(request._get_current_object())
  File "/workspace/main.py", line 53, in main
    results = asyncio.run(collect(urls,tags))
  File "/opt/python3.9/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/python3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/workspace/main.py", line 32, in collect
    return await asyncio.gather(*tasks)
  File "/workspace/main.py", line 18, in getPage
    await r.html.arender(timeout=40,sleep=1)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 615, in arender
    self.browser = await self.session.browser
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 714, in browser
    self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 307, in launch
    return await Launcher(options, **kwargs).launch()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 159, in launch
    signal.signal(signal.SIGINT, _close_process)
  File "/opt/python3.9/lib/python3.9/signal.py", line 56, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter 

这是我的代码:

from requests_html import AsyncHTMLSession
import asyncio
from bs4 import BeautifulSoup as bs

    urls = ['list of urls']
    tags = ['p','h1']
    
    async def getPage(s, url):
        r = await s.get(url)
        await r.html.arender(timeout=60,sleep=3, scrolldown=2)
        p = bs(r.html.html, "html.parser")
        elmList = []
        elmList.append(url)
        for t in tags:
            elements = p.findAll(t)  
            for e in elements:
                elmList.append(e.text)
    
        return elmList
                       
    async def collect(urls):
        s = AsyncHTMLSession()
        tasks = (getPage(s,url) for url in urls)
        return await asyncio.gather(*tasks)
    
    results = asyncio.run(collect(urls))

我也尝试过使用“非异步”HTMLSession 并重写代码以一次执行一个 URL,但在这种情况下我得到了完全相同的错误消息,即“信号仅在主线程中有效”。

我也尝试在 Cloud Functions 和 Cloud Run 环境中运行它,得到相同的结果。

另外,在论坛寻求建议后尝试过像这样手动设置循环,但这没有效果。

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)

如果有人知道如何使用其他库/方法来完成此任务,请告诉我,它甚至不一定必须是异步的,唯一的要求是页面的 javascript 渲染。

python web-scraping google-cloud-functions python-asyncio google-cloud-run
2个回答
0
投票

您是否尝试过

keep_page=True
作为 r.html.arender 的参数?

我问的原因:该错误似乎是在关闭用于渲染 JS 的浏览器时发生的。也许

keep_page=True
可以避免这种情况。


0
投票

@manofthebear 你解决问题了吗?我也有同样的问题。

© www.soinside.com 2019 - 2024. All rights reserved.