运行 scrapy Spider 但输出为空白。蟒蛇

问题描述 投票:0回答:1

我试图让这个蜘蛛浏览 csv 中包含的 1600 个 url 列表,并从页面中提取电子邮件和电话号码。如果有人已经有了这样的程序,我很乐意使用它,但我也很想知道我哪里出错了。这是我的代码,我已通过 chat gpt 将其传递以收紧并注释它。

import scrapy
import pandas as pd
import os
import re
import logging


class Spider(scrapy.Spider):
    name = 'business_scrape'

    def extract_emails(self, text):
        # Extract email addresses using a comprehensive regex pattern
        emails = re.findall(
            r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
        return emails

    def extract_phone(self, text):
        # Extract phone numbers
        phone_numbers = re.findall(
            r'(?:(?:\+\d{1,2}\s?)?\(?\d{3}\)?[-.\s]?)?\d{3,4}[-.\s]?\d{4}', text)
        return phone_numbers

    def start_requests(self):
        # Read the initial CSV file with columns [name, url, category]
        csv = 'bozeman_businesses.csv'  # Specify your CSV file
        init_df = pd.read_csv(csv)

        for _, row in init_df.iterrows():
            name = row['name']
            url = row['url']
            category = row['category']

            yield scrapy.Request(url=url, callback=self.parse_link, meta={'name': name, 'category': category})

    def parse_link(self, response):
        name = response.meta['name']
        category = response.meta['category']

        # Initialize logging
        logging.basicConfig(
            filename='scrapy.log', format='%(levelname)s: %(message)s', level=logging.INFO)

        # Log the start of crawling
        logging.info('Crawling started.')
        for word in self.reject:
            if word in str(response.url):
                return

        html_text = str(response.text)
        try:
            # Extract email addresses using the function
            mail_list = self.extract_emails(html_text)

            # Extract phone numbers using the function
            phone_numbers = self.extract_phone(html_text)

            # Ensure 'email' and 'phone' lists have the same length
            min_length = min(len(mail_list), len(phone_numbers))
            mail_list = mail_list[:min_length]
            phone_numbers = phone_numbers[:min_length]

            dic = {'name': [name], 'category': [category], 'email': mail_list,
                   'phone': phone_numbers, 'url': [str(response.url)]}

        except Exception as e:
            # Handle the failure by setting "NA" values
            self.logger.error(f'Error scraping {response.url}: {e}')
            dic = {'name': [name], 'category': [category], 'email': ['NA'],
                   'phone': ['NA'], 'url': [str(response.url)]}

        # Check if the output file exists and prompt the user if it does
        if os.path.exists(self.path):
            response = self.ask_user('File already exists, replace?')
            if response is False:
                return

        # Create or overwrite the output file
        self.create_or_overwrite_file(self.path)

        # Append the data to the output CSV file
        df = pd.DataFrame(dic)
        df.to_csv(self.path, mode='a', header=False, index=False)

    # Define the reject list and output file path
    reject = ['example.com', 'example2.com']  # Adjust as needed
    path = 'output.csv'  # Adjust the output file path as needed

    def ask_user(self, question):
        response = input(question + ' y/n' + '\n')
        return response.lower() == 'y'

    def create_or_overwrite_file(self, path):
        response = False
        if os.path.exists(path):
            response = self.ask_user('File already exists, replace?')
            if response is False:
                return

        with open(path, 'wb') as file:
            file.close()

我的日志很长,所以这里有一些摘录:

2023-09-21 15:51:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hrblock.com/robots.txt> (referer: None)
2023-09-21 15:51:02 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.gallatinvalleytaxservices.com>: HTTP status code is not handled or not allowed
2023-09-21 15:51:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (308) to <GET https://www.amaticscpa.com/> from <GET http://www.amaticscpa.com>
2023-09-21 15:51:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.hrblock.com/> from <GET http://www.hrblock.com>
2023-09-21 15:51:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://amaticscpa.com/> from <GET https://www.amaticscpa.com/>
2023-09-21 15:51:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hrblock.com/> (referer: None)
2023-09-21 15:51:03 [root] INFO: Crawling started.

目前看来还不错

file "/Users/me/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 502, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/Users/me/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 120, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/Users/me/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 674, in _extract_index
    raise ValueError("All arrays must be of the same length"

我觉得这个关于数组长度的错误就是问题所在。如果过程失败,我尝试添加 NA 值。好像没用:(

也出现了这个错误。

2023-09-21 15:52:03 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.hpwcpas.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-09-21 15:52:03 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.hpwcpas.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

日志的其余部分基本上是重复的。

我所做的和期望的:

  • 我尝试添加 try, except 循环来跳过问题站点并放入 NA 值。
  • 我预计,即使我无法抓取我想要的信息,我仍然会有一个包含公司名称、URL 和 NA 值的 csv
  • 我也对如何更好地自己调试这个感兴趣
python web-scraping logging scrapy pyspider
1个回答
0
投票

好吧,我做了一些调整并取得了成功:

  1. 每个 URL 处的输出文件都被覆盖
  2. 无效的 URL 导致蜘蛛停止
  3. 报废的信息被放入列表中,然后添加到字典中,不同的列表长度会导致错误。

解决方案1:将self.create移至start_requests函数而不是parse_link;如果 os.path.getsize(path) == 0,则添加数据帧标头。并从 parse_link 中删除ask_user 函数。

解决方案2:将其添加到start_requests函数中。 CSV 网址包含“未找到”字符串。

 # Check if the URL is "Not found" and skip it
        if url.strip() == 'Not found':
            self.logger.warning(f'Skipping invalid URL for {name}')
            continue

解决方案 3:我确保将列表转换为逗号分隔的字符串。

一般的刮法还是有问题。我想要来自网站的电子邮件,但需要一个更强大的系统来浏览页面和元数据以获取它们。

如果您复制我的内容,请确保从电子邮件搜索中排除通常采用 [email protected]/jpg 形式的图像文件。

© www.soinside.com 2019 - 2024. All rights reserved.