我有一个相当基本的抓取应用程序,我想在 Google Cloud 环境中运行,我正在使用 requests_html 异步库,它在我的本地环境中运行良好,但是我无法弄清楚如何在其中运行它谷歌云已经摆弄它好几天了。 该应用程序的目的是使用 html.arender 简单地渲染一些 javascript 页面(包含在 urls 数组中),然后使用 BeautifulSoup 提取一些特定标签的内容(从标签数组中)。
我不断收到的错误消息是:
“信号仅在主解释器的主线程中起作用”
Traceback (most recent call last):
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/functions_framework/__init__.py", line 99, in view_func
return function(request._get_current_object())
File "/workspace/main.py", line 53, in main
results = asyncio.run(collect(urls,tags))
File "/opt/python3.9/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/python3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "/workspace/main.py", line 32, in collect
return await asyncio.gather(*tasks)
File "/workspace/main.py", line 18, in getPage
await r.html.arender(timeout=40,sleep=1)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 615, in arender
self.browser = await self.session.browser
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/requests_html.py", line 714, in browser
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 307, in launch
return await Launcher(options, **kwargs).launch()
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pyppeteer/launcher.py", line 159, in launch
signal.signal(signal.SIGINT, _close_process)
File "/opt/python3.9/lib/python3.9/signal.py", line 56, in signal
handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter
这是我的代码:
from requests_html import AsyncHTMLSession
import asyncio
from bs4 import BeautifulSoup as bs
urls = ['list of urls']
tags = ['p','h1']
async def getPage(s, url):
r = await s.get(url)
await r.html.arender(timeout=60,sleep=3, scrolldown=2)
p = bs(r.html.html, "html.parser")
elmList = []
elmList.append(url)
for t in tags:
elements = p.findAll(t)
for e in elements:
elmList.append(e.text)
return elmList
async def collect(urls):
s = AsyncHTMLSession()
tasks = (getPage(s,url) for url in urls)
return await asyncio.gather(*tasks)
results = asyncio.run(collect(urls))
我也尝试过使用“非异步”HTMLSession 并重写代码以一次执行一个 URL,但在这种情况下我得到了完全相同的错误消息,即“信号仅在主线程中有效”。
我也尝试在 Cloud Functions 和 Cloud Run 环境中运行它,得到相同的结果。
另外,在论坛寻求建议后尝试过像这样手动设置循环,但这没有效果。
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
如果有人知道如何使用其他库/方法来完成此任务,请告诉我,它甚至不一定必须是异步的,唯一的要求是页面的 javascript 渲染。
您是否尝试过
keep_page=True
作为 r.html.arender 的参数?
我问的原因:该错误似乎是在关闭用于渲染 JS 的浏览器时发生的。也许
keep_page=True
可以避免这种情况。
@manofthebear 你解决问题了吗?我也有同样的问题。