URL警告: allowed_domains 仅接受域,而不接受 URL

问题描述 投票:0回答:1

我是 python 和爬虫的新手,需要帮助来理解我尝试从起始 URL 获取的每个链接上发生的以下错误:['https://www.eskom.co.za/category /新闻/]

2024-01-31 17:02:42 [py.warnings] WARNING: C:\Users\27671\PycharmProjects\Web crawling\venv\Lib\site-packages\scrapy\spidermiddlewares\offsite.py:74: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry
 https://www.eskom.co.za/2023/02/ in allowed_domains.
  warnings.warn(message, URLWarning)

我本想爬进 2022 年的媒体声明,并为一个小项目抓取每个声明的描述。

下面是爬虫的原代码

#import libraries
from bs4 import BeautifulSoup as bs
import requests
import re

#to crawl extracted hyperlinks
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

#requesting with get from url, assign to output which is response
response = requests.get('https://www.eskom.co.za/category/news/')

#assign results to variable from response
soup = bs(response.text, 'html.parser')

#find all a tags for hyperlinks
archives = soup.find_all('a')

# for loop href for all links
hrefs = []
for link in archives:
    hrefs.append((link.get('href')))


# assign class for crawler
class Crawler(CrawlSpider):
    name = 'link_crawler'
    allowed_domains = hrefs
    start_url = ['https://www.eskom.co.za/2022/07/']

    rules = (
        Rule(LinkExtractor(allow='2022'), callback='parse_item'),
    )

    # define parse method for results, yield for scraping from linked page(media statements)

    def parse_item(self, response):
        yield {
            'description': response.css('entry-content-wrap h2::text').get()
        }


from scrapy.crawler import CrawlerProcess


class Crawler(CrawlSpider):
    name = 'link_crawler'
    allowed_domains = hrefs
    start_urls = ['https://www.eskom.co.za/category/news/']

    rules = (
        Rule(LinkExtractor(allow='2022'), callback='parse_item'),
    )

    def parse_item(self, response):
        yield {
            'description': response.css('entry-content-wrap h2::text').get()
        }
python-3.x web-scraping web-crawler
1个回答
0
投票

使用urlparse从url中提取域并将其添加到允许的域列表中

导入urlparse

爬虫类(CrawlSpider):

name = "link crawler"

def add_dmn(self):

  url = "your url"

  dmn = urlparse.urlparse(url).netloc

  self.allowed_domains = [dmn]

  yield scrapy.Request(url=url, callback=self.parse)
© www.soinside.com 2019 - 2024. All rights reserved.