Python-使用通配符在帖子中找到正确的链接

Question

我尝试使用以下python代码从论坛中提取链接。该帖子包含很多html链接，我尝试找到一个特殊的链接：

<a href="https://site.html" target="_blank" class="externalLink" rel="nofollow">Daily news <img src="https://site.html/pic.png" class="bbCodeImage LbImage" alt="[IMG]" data-url="https://site.html/pic.png"></a>

这是我的代码：

from bs4 import BeautifulSoup
import defs
import re

def find_link(soup ,date, section, URL):
    #Find the right post
    section = soup.find('li', {"data-author":"Ghostwriter"})
    #Search the link inside the post
    link = section.find(string=" Daily news ")
    #Mark the whole html section
    section_new = str(link.find_parents('a'))
    #get the link
    link_new = re.search("(?P<url>https?://[^\s]+)", section_new).group("url")

现在的问题是，有时“每日新闻”之前或之后都没有空格，但是我的代码失败了：

AttributeError: 'NoneType' object has no attribute 'find_parents'

例如，如何使用一些通配符使代码更灵活。例如：

link = section.find(string="*Daily news*")

非常感谢！

Answer 1

尝试使用tags.get返回一个字符串，那么您应该应该能够使用str.statswith完全执行您想要的操作。

Answer 2

我相信您可以将re.compile用作string的参数。这应该允许您创建与要查找的字符串匹配的正则表达式。有关python regex的更多信息，请参见：https://docs.python.org/3/library/re.html

Python-使用通配符在帖子中找到正确的链接

问题描述投票：0回答：2

2个回答

最新问题

Python-使用通配符在帖子中找到正确的链接

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2