我试图从无头浏览器页面的 first 表获取 href 链接,但该错误对我没有帮助,因为它没有告诉我它是什么,只是下面有很多 ^ 符号。
我不得不切换到无头浏览器,因为我正在抓取空表来了解网站 html 的工作原理,而且我承认我不明白它是如何工作的。
我还想完成链接,以便它们可以进一步使用,这是以下代码的最后三行:
from playwright.sync_api import sync_playwright
# headless browser to scrape
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://fbref.com/en/comps/9/Premier-League-Stats")
#open the file up
with open("path", 'r') as f:
file = f.read()
years = list(range(2024,2022, -1))
all_matches = []
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"
for year in years:
standings_table = page.locator("table.stats_table").first
link_locators = standings_table.get_by_role("link").all()
for l in link_locators:
l.get_attribute("href")
print(link_locators)
link_locators = [l for l in links if "/squads/" in l]
team_urls = [f"https://fbref.com{l}" for l in link_locators]
print(team_urls)
browser.close()
我得到的堆栈跟踪只是:
Traceback (most recent call last):
File "path", line 27, in <module>
link_locators = standings_table.get_by_role("link").all()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "path\.venv\Lib\site-packages\playwright\sync_api\_generated.py", line 15936, in all
return mapping.from_impl_list(self._sync(self._impl_obj.all()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "path\.venv\Lib\site-packages\playwright\_impl\_sync_base.py", line 102, in _sync
raise Error("Event loop is closed! Is Playwright already stopped?")
playwright._impl._errors.Error: Event loop is closed! Is Playwright already stopped?
Process finished with exit code 1
我的代码只有 33 行,因为它是循环的开始,所以我不确定堆栈中的最后两个错误指的是什么。
我只是无法提取href链接,可能与.first有关
我实现了以下问题的解决方案:
使用 python playwright 获取 href 链接
但是不起作用
当上下文管理器(
with
)块结束时,页面和浏览器将关闭,因此您无法再使用它们。错误的最小重现是:
from playwright.sync_api import sync_playwright # 1.40.0
with sync_playwright() as p:
browser = p.chromium.launch()
browser.close()
这是重写建议:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
url = "<Your URL>"
page.goto(url, wait_until="domcontentloaded")
team_urls = []
for year in range(2024, 2022, -1):
standings_table = page.locator("table.stats_table").first
for l in standings_table.get_by_role("link").all():
href = l.get_attribute("href")
if "/squads/" in href:
team_urls.append(f'https://www.fbref.com{href}')
print(team_urls)
browser.close()
您也可以在没有 Playwright 的情况下执行此操作,因为数据在静态 HTML 中可用:
import requests # 2.25.1
from bs4 import BeautifulSoup # 4.10.0
url = "https://fbref.com/en/comps/9/Premier-League-Stats"
soup = BeautifulSoup(requests.get(url).text, "lxml")
team_urls = []
for year in range(2024, 2022, -1):
standings_table = soup.select_one("table.stats_table")
for l in standings_table.select("a"):
href = l["href"]
if "/squads/" in href:
team_urls.append(f'https://www.fbref.com{href}')
print(team_urls)