我试图让这个蜘蛛浏览 csv 中包含的 1600 个 url 列表,并从页面中提取电子邮件和电话号码。如果有人已经有了这样的程序,我很乐意使用它,但我也很想知道我哪里出错了。这是我的代码,我已通过 chat gpt 将其传递以收紧并注释它。
import scrapy
import pandas as pd
import os
import re
import logging
class Spider(scrapy.Spider):
name = 'business_scrape'
def extract_emails(self, text):
# Extract email addresses using a comprehensive regex pattern
emails = re.findall(
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
return emails
def extract_phone(self, text):
# Extract phone numbers
phone_numbers = re.findall(
r'(?:(?:\+\d{1,2}\s?)?\(?\d{3}\)?[-.\s]?)?\d{3,4}[-.\s]?\d{4}', text)
return phone_numbers
def start_requests(self):
# Read the initial CSV file with columns [name, url, category]
csv = 'bozeman_businesses.csv' # Specify your CSV file
init_df = pd.read_csv(csv)
for _, row in init_df.iterrows():
name = row['name']
url = row['url']
category = row['category']
yield scrapy.Request(url=url, callback=self.parse_link, meta={'name': name, 'category': category})
def parse_link(self, response):
name = response.meta['name']
category = response.meta['category']
# Initialize logging
logging.basicConfig(
filename='scrapy.log', format='%(levelname)s: %(message)s', level=logging.INFO)
# Log the start of crawling
logging.info('Crawling started.')
for word in self.reject:
if word in str(response.url):
return
html_text = str(response.text)
try:
# Extract email addresses using the function
mail_list = self.extract_emails(html_text)
# Extract phone numbers using the function
phone_numbers = self.extract_phone(html_text)
# Ensure 'email' and 'phone' lists have the same length
min_length = min(len(mail_list), len(phone_numbers))
mail_list = mail_list[:min_length]
phone_numbers = phone_numbers[:min_length]
dic = {'name': [name], 'category': [category], 'email': mail_list,
'phone': phone_numbers, 'url': [str(response.url)]}
except Exception as e:
# Handle the failure by setting "NA" values
self.logger.error(f'Error scraping {response.url}: {e}')
dic = {'name': [name], 'category': [category], 'email': ['NA'],
'phone': ['NA'], 'url': [str(response.url)]}
# Check if the output file exists and prompt the user if it does
if os.path.exists(self.path):
response = self.ask_user('File already exists, replace?')
if response is False:
return
# Create or overwrite the output file
self.create_or_overwrite_file(self.path)
# Append the data to the output CSV file
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False, index=False)
# Define the reject list and output file path
reject = ['example.com', 'example2.com'] # Adjust as needed
path = 'output.csv' # Adjust the output file path as needed
def ask_user(self, question):
response = input(question + ' y/n' + '\n')
return response.lower() == 'y'
def create_or_overwrite_file(self, path):
response = False
if os.path.exists(path):
response = self.ask_user('File already exists, replace?')
if response is False:
return
with open(path, 'wb') as file:
file.close()
我的日志很长,所以这里有一些摘录:
2023-09-21 15:51:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hrblock.com/robots.txt> (referer: None)
2023-09-21 15:51:02 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.gallatinvalleytaxservices.com>: HTTP status code is not handled or not allowed
2023-09-21 15:51:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (308) to <GET https://www.amaticscpa.com/> from <GET http://www.amaticscpa.com>
2023-09-21 15:51:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.hrblock.com/> from <GET http://www.hrblock.com>
2023-09-21 15:51:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://amaticscpa.com/> from <GET https://www.amaticscpa.com/>
2023-09-21 15:51:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hrblock.com/> (referer: None)
2023-09-21 15:51:03 [root] INFO: Crawling started.
目前看来还不错
file "/Users/me/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 502, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/Users/me/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 120, in arrays_to_mgr
index = _extract_index(arrays)
File "/Users/me/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 674, in _extract_index
raise ValueError("All arrays must be of the same length"
我觉得这个关于数组长度的错误就是问题所在。如果过程失败,我尝试添加 NA 值。好像没用:(
也出现了这个错误。
2023-09-21 15:52:03 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.hpwcpas.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-09-21 15:52:03 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.hpwcpas.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
日志的其余部分基本上是重复的。
我所做的和期望的:
好吧,我做了一些调整并取得了成功:
解决方案1:将self.create移至start_requests函数而不是parse_link;如果 os.path.getsize(path) == 0,则添加数据帧标头。并从 parse_link 中删除ask_user 函数。
解决方案2:将其添加到start_requests函数中。 CSV 网址包含“未找到”字符串。
# Check if the URL is "Not found" and skip it
if url.strip() == 'Not found':
self.logger.warning(f'Skipping invalid URL for {name}')
continue
解决方案 3:我确保将列表转换为逗号分隔的字符串。
一般的刮法还是有问题。我想要来自网站的电子邮件,但需要一个更强大的系统来浏览页面和元数据以获取它们。
如果您复制我的内容,请确保从电子邮件搜索中排除通常采用 [email protected]/jpg 形式的图像文件。