使用循环进行抓取抓取

问题描述 投票:0回答:1

我想从http://www.stfrancismedical.org/asp/job-summary.asp?cat=4中抓取信息,但是我不知道怎么做,因为我只知道递归抓取。有没有办法使用循环来抓取或获取每个作业的所有信息?

或者任何其他想法都很好。

loops web-scraping scrapy
1个回答
1
投票

该页面的结构有点奇怪。一个表,其所有行都在同一级别深度中。这使得xpath更加难以同时提取每个作业的所有数据。我的方法是使用模块运算符,并为每个循环填充item对象。

尽管如此,该页面没有链接,所以使用蜘蛛程序非常简单。

第一步,创建项目:

scrapy startproject stfrancismedical
cd stfrancismedical

第二步,创建蜘蛛:

scrapy genspider -t basic stfrancismedical_spider 'stfrancismedical.org'

第三步,使用作业的所有字段创建item

vim stfrancismedical/items.py

具有新内容,例如:

from scrapy.item import Item, Field

class StfrancismedicalItem(Item):
    department = Field()
    employment = Field()
    shift = Field()
    weekends_holidays = Field()
    biweekly_hours = Field()
    description = Field()
    requirements = Field()

第四步,编辑蜘蛛:

vim stfrancismedical/spiders/stfrancismedical_spider.py

带有内容:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from stfrancismedical.items import StfrancismedicalItem

rn = ('department', 'employment', 'shift', 'weekends_holidays',
        'biweekly_hours', 'description', 'requirements')

class StfrancismedicalSpiderSpider(BaseSpider):
    name = "stfrancismedical_spider"
    allowed_domains = ["stfrancismedical.org"]
    start_urls = ( 
        'http://www.stfrancismedical.org/asp/job-summary.asp?cat=4',
    )   


    def parse(self, response):
        items = []
        hxs = HtmlXPathSelector(response)
        for i, tr in enumerate(hxs.select('/html/body/div/table//tr[count(./td)=2]')):
            if (i % 7 == 0): 
                if (i > 0): items.append(item)
                item = StfrancismedicalItem()
            idx = i % 7 
            item[rn[idx]] = tr.select('./td[2]//text()').extract()[0]
        else:
            items.append(item)
        return items

并像这样运行:

scrapy crawl stfrancismedical_spider -o stfrancismedical.json -t json

将使用数据创建一个新文件stfrancismedical.json

[{"requirements": "Skilled in Cath Lab nursing, 2 years experience and patient recovery experience. A Current valid NJ RN license with a current ACLS certification.", "description": "Responsible for the delivery of individualized patient care to assigned patients utilizing the nursing process of assessment, planning, implementation and evaluation.", "shift": "Day - Evening - Night", "biweekly_hours": "Varied", "weekends_holidays": "No", "department": "Cardiac Care", "employment": "Pool"},
{"requirements": "Requirements: A Current valid NJ RN license with a current ACLS & BLS certification.", "description": "Responsible for the delivery of individualized patient care to assigned critical care patients utilizing the nursing process of assessment, planning, implementation and evaluation. ", "shift": "Evening", "biweekly_hours": "72", "weekends_holidays": "Yes", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "ACLS, NJ License required.\u00a0 Balloon pump certification preferred.", "description": "Provide comprehensive Nursing care to critically ill patients.\u00a0 ", "shift": "Day", "biweekly_hours": "72 - 11am - 11pm", "weekends_holidays": "Yes", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "ACLS, NJ License required.\u00a0 Balloon pump certification preferred.", "description": "Provide comprehensive Nursing care to critically ill patients. ", "shift": "Evening - Night", "biweekly_hours": "72 - 7pm - 7am", "weekends_holidays": "No", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "Associates Degree in Nursing, Healthcare, or equivalent experience: BSN preferred.", "description": "Must be detail oriented and able to follow detailed procedures to ensure accuracy.\u00a0 Must demonstrate excellent follow up skills.\u00a0 Ability to coordinate and priortize multiple duties.\u00a0 Understands interactions amongst clinical areas and their roles within hospital.\u00a0 Advanced knowledge in computer skills, including knowledge of Microsoft Word, Excel and PowerPoint.\u00a0", "shift": "Day", "biweekly_hours": "80", "weekends_holidays": "No", "department": "Nursing Education", "employment": "Full-Time"},
...
© www.soinside.com 2019 - 2024. All rights reserved.