我正在研究一个脚本,该脚本将从给定网站中提取电子邮件。我遇到的一个麻烦是,通常我要查找的电子邮件会出现在“联系我们”或“我们的人员”页面上。到目前为止,我所写的内容将在主网页(即www.examplecompany.com)上查找电子邮件,如果找不到任何内容,它将在该页面链接的页面中查找电子邮件。见下文:
import requests, bs4, re, sys, logging
logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - %(levelname)s - %(message)s')
print('Fetching Website...')
target_URL = 'www.exampleURL.com' #URL goes here
res = requests.get(target_URL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
type(soup)
my_list = []
for link in soup.find_all('a'):
my_list.append(link.get('href'))
emailregex = re.compile(r'''(
[a-zA-Z0-9._%+-:]+
@
[a-zA-Z0-9.-]+
\.[a-zA-Z]{2,4}
)''', re.VERBOSE)
# Converts each item in list to string
myemail_list = list(map(str, my_list))
# Filters out items in list that to not fit regex criteria
newlist = list(filter(emailregex.search, myemail_list))
if len(newlist) < 1:
new_site = []
for i in range(len(my_list)):
new_site.append(f'{target_URL}{(my_list[i])}')
try:
for site in range(len(new_site)):
newthing = requests.get(new_site[site])
newthing.raise_for_status()
freshsoup = bs4.BeautifulSoup(newthing.text, 'lxml')
type(freshsoup)
except requests.exceptions.HTTPError as e:
pass
final_list = []
for link in freshsoup.find_all('a'):
final_list.append(link.get('href'))
print(final_list)
else:
print(newlist)
[我认为我要解决的最大问题是,我整理和搜索相关URL的方法是错误的。它可以在某些站点上运行,但在其他站点上则不能运行,并且容易出错。谁能给我一个更好的主意吗?
顺便说一句,如果看起来我不知道我在做什么,那你是对的。我刚刚开始学习python,这是一个个人项目,可帮助我更好地掌握基础知识,因此对您的帮助表示赞赏。
谢谢您的帮助。
尝试:
import requests
import re
from bs4 import BeautifulSoup
all_links = [];mails=[]
# your url here
url = 'https://kore.ai/'
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
if(("contact" in i or "Contact")or("Career" in i or "career" in i))or('about' in i or "About" in i)or('Services' in i or 'services' in i):
all_links.append(i)
all_links=set(all_links)
def find_mails(soup):
for name in soup.find_all('a'):
if(name is not None):
email_text=name.text
match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',email_text))
if('@' in email_text and match==True):
email_text=email_text.replace(" ",'').replace('\r','')
email_text=email_text.replace('\n','').replace('\t','')
if(len(mails)==0)or(email_text not in mails):
print(email_text)
mails.append(email_text)
for link in all_links:
if(link.startswith("http") or link.startswith("www")):
r=requests.get(link)
data=r.text
soup=BeautifulSoup(data,'html.parser')
find_mails(soup)
else:
newurl=url+link
r=requests.get(newurl)
data=r.text
soup=BeautifulSoup(data,'html.parser')
find_mails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")