如何在没有 ancore 标签的情况下从 html 中获取所有链接?

问题描述 投票:0回答:1

我想从代码中给出的链接中获取所有链接,尤其是这个

 https://api.smartrecruiters.com/v1/companies/cermaticom/postings/
链接。我在网上找到的所有正则表达式都只获取像
https://api.smartrecruiters.com

这样的简单链接
import requests
import re
from bs4 import BeautifulSoup

url='https://www.cermati.com/karir/'

x = requests.get(url)
html_doc=x.text
soup = BeautifulSoup(html_doc,"html.parser" )
print(soup)

python regex beautifulsoup html-parsing
1个回答
0
投票

您可以

findall
直接从响应中获取网址内容

p= r'https://api\.smartrecruiters\.com/.*?(?=")'

urls = re.findall(p, html_doc)

输出:

['https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999896690090',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999896676342',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999896672229',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999894177703',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999874413809',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999898147783',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999898110826',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999897538273',
 'https://api.smartrecruiters.com/v1/companies/cermaticom/postings/743999897207847',
...
© www.soinside.com 2019 - 2024. All rights reserved.