使用 Python 的网站网络抓取电子邮件

Question

在我的Python代码中，我有正则表达式来查找电子邮件：

soup = BeautifulSoup(driver.page_source, "html.parser")
text_email = soup.get_text()
emails1 = re.findall(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})', str(text_email))

大约 90% 的情况下此代码会返回正确的电子邮件地址。

但下面我有一个例子，它返回错误的电子邮件格式

在网页上： https://s7health.pl/kontakt/

我们有电话、电子邮件和一些短信：

71 342 88 41
[email protected]
Infolinia medyczna

上面文字的源代码是：

<a class="text-decoration-underline" href="tel:+48713428841">71 342 88 41</a><br /><a class="text-decoration-underline" href="mailto:[email protected]">[email protected]</a></div><style>.porto-u-3166.porto-u-heading{text-align:left}</style></div><div class="porto-u-heading  wpb_custom_95aa9a11c17ad45cfabaf210d84ee7cc porto-u-4257"><div class="porto-u-main-heading"><h3   style="font-weight:700;color:#0c6d70;font-size:1em;line-height:24px;">Infolinia medyczna</h3></div>

我的代码返回电子邮件为： [电子邮件受保护]

但应返回电子邮件为： [电子邮件受保护]

除了使用 mailto 短语搜索电子邮件的问题 - 该短语可能不存在之外，为什么要在电子邮件中添加其他字符？怎么解决这个问题？

问候

Answer 1

问题出在代码中，而不是正则表达式中：

from bs4 import BeautifulSoup
import requests
import re
response = requests.get('http://0.0.0.0:8000/file.html')
soup = BeautifulSoup(response.content, 'html.parser')
text_email = soup.get_text()
emails1 = re.findall(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})', str(text_email))
print(emails1[0])

产量：

[email protected]

使用 Python 的网站网络抓取电子邮件

问题描述投票：0回答：1

1个回答

最新问题

使用 Python 的网站网络抓取电子邮件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1