URL警告: allowed_domains 仅接受域,而不接受 URL

问题描述 投票:0回答:1

我是 python 和爬虫的新手,需要帮助来理解我尝试从起始 URL 获取的每个链接上发生的以下错误:['https://www.eskom.co.za/category /新闻/]

2024-01-31 17:02:42 [py.warnings] WARNING: C:\Users\27671\PycharmProjects\Web crawling\venv\Lib\site-packages\scrapy\spidermiddlewares\offsite.py:74: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry
 https://www.eskom.co.za/2023/02/ in allowed_domains.
  warnings.warn(message, URLWarning)

我本想爬进 2022 年的媒体声明,并为一个小项目抓取每个声明的描述。


#import libraries
from bs4 import BeautifulSoup as bs
import requests
import re

#to crawl extracted hyperlinks
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

#requesting with get from url, assign to output which is response
response = requests.get('https://www.eskom.co.za/category/news/')

#assign results to variable from response
soup = bs(response.text, 'html.parser')

#find all a tags for hyperlinks
archives = soup.find_all('a')

# for loop href for all links
hrefs = []
for link in archives:

# assign class for crawler
class Crawler(CrawlSpider):
    name = 'link_crawler'
    allowed_domains = hrefs
    start_url = ['https://www.eskom.co.za/2022/07/']

    rules = (
        Rule(LinkExtractor(allow='2022'), callback='parse_item'),

    # define parse method for results, yield for scraping from linked page(media statements)

    def parse_item(self, response):
        yield {
            'description': response.css('entry-content-wrap h2::text').get()

from scrapy.crawler import CrawlerProcess

class Crawler(CrawlSpider):
    name = 'link_crawler'
    allowed_domains = hrefs
    start_urls = ['https://www.eskom.co.za/category/news/']

    rules = (
        Rule(LinkExtractor(allow='2022'), callback='parse_item'),

    def parse_item(self, response):
        yield {
            'description': response.css('entry-content-wrap h2::text').get()
python-3.x web-scraping web-crawler




name = "link crawler"

def add_dmn(self):

  url = "your url"

  dmn = urlparse.urlparse(url).netloc

  self.allowed_domains = [dmn]

  yield scrapy.Request(url=url, callback=self.parse)
© www.soinside.com 2019 - 2024. All rights reserved.