我正在从html页面解析一些链接,我想检测所有符合以下模式的链接:
http://www.example.com/category1/some-content-here/
http://www.example.com/category-12/some-content-here/
它不应该匹配以下链接:
http://www.example.com/category1/
http://www.example.org/category-12/some-content-here/
谢谢!
您可以使用BeautifulSoup
来解析HTML a
标记,然后使用正则表达式来过滤原始的完整结果:
from bs4 import BeautifulSoup as soup
import re
sample = """
<div id='test'>
<a href='http://www.example.com/category1/some-content-here/'>Someting</a>
<a href='http://www.example.com/category-12/some-content-here/'>Someting Here</a>
<a href='http://www.example.com/category1/'>Someting1</a>
<a href='http://www.example.org/category-12/some-content-here/'>Sometingelse</a>
</div>
"""
a = [i['href'] for i in soup(sample, 'lxml').find_all('a') if re.findall('http://[\w\.]+\.com/[\w\-]+/[\w\-]+/', i['href'])]
输出:
['http://www.example.com/category1/some-content-here/', 'http://www.example.com/category-12/some-content-here/']