我使用智能代理在 Zyte 上托管了一个 scrapy 蜘蛛。
我的蜘蛛相当简单,因为它从 URL 列表开始爬行。
解析方法使用简单的链接提取器来提取域上的链接,然后抓取这些链接。
简化解析方法:
def parse(self, response):
internal_le = LinkExtractor(
allow_domains=tld_t, # try to stay on domain (this is a tldextract of response.url)
unique=True, # de-dup
#deny_extensions=self.deny_extensions
)
in_links = internal_le.extract_links(response)
for link in in_links:
if link.url:
yield Request(
link.url,
callback=self.parse,
)
因为deny_extensions默认为scrapy.DENY_EXTENSIONS,其中包含PDF文件,所以我认为它不会抓取PDF链接。但是,我有重定向到外部托管 PDF 文件的内部链接。
以下是一些日志摘录和示例:
33: 2023-11-27 23:41:01 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf> (referer: https://west.usd262.net/about) More
34: 2023-11-27 23:41:02 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx> (referer: https://west.usd262.net/about) More
35: 2023-11-27 23:41:05 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1676649887/usd262net/adlo2wuxxpqa7pmnxmkx/MiddleSchoolBellSchedule22_23docx.pdf> (referer: https://vcms.usd262.net/about) More
36: 2023-11-27 23:41:10 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073617/usd262net/zjuysts6fymaf5gjumlc/VCMSStudentHandbook23-24Finaldocx.pdf> (referer: https://vcms.usd262.net/about) More
这是一条痕迹:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/defer.py", line 279, in iter_errback
yield next(it)
^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/sh_scrapy/middlewares.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in <genexpr>
return (r for r in result or () if self._filter(r, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 352, in <genexpr>
return (self._set_referer(r, response) for r in result or ())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 27, in <genexpr>
return (r for r in result or () if self._filter(r, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 31, in <genexpr>
return (r for r in result or () if self._filter(r, response, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/tmp/unpacked-eggs/__main__.egg/edtech/spiders/edcrawler.py", line 117, in parse
ex_links = external_le.extract_links(response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/lxmlhtml.py", line 239, in extract_links
base_url = get_base_url(response)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/response.py", line 26, in get_base_url
text = response.text[0:4096]
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/http/response/__init__.py", line 137, in text
raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
我尝试了各种方法来更改我的链接提取器,但想必该链接对于链接提取器来说看起来不错。它是包含 PDF 文件的重定向,该文件被下载并产生错误。
起始网址示例 起始网址
该页面上的链接提取到“in_links”中提取的内部链接
我能想到的唯一解决此问题的方法是使用自定义中间件来替换重定向并在 request.url 中查找 r".pdf$"。
我错过了什么吗?使用最新的scrapy 2.11.0。 另外,在 scrapy github 上记录了问题github/6159。
1:scrapy docs.redirect 中间件
我认为在这种情况下你最好的选择是对
RedirectMiddleware
进行子类化,并简单地添加几行来检查 .pdf
扩展的初始响应的 Location 标头,并在发现时引发 IgnoreRequest
异常.
这一切只需几行即可完成。
示例:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
from scrapy.exceptions import IgnoreRequest
class PDFRedirect(RedirectMiddleware):
def process_response(self, request, response, spider):
location = response.headers.get("Location", b"").decode()
if location.lower().endswith(".pdf") or location.lower().endswith(".docx"):
print(f"IGNORING PDF {location}")
raise IgnoreRequest("max redirections reached")
return super().process_response(request, response, spider)
class PdfRedirectSpider(scrapy.Spider):
name = 'nopdfs'
allowed_domains = ['west.usd262.net']
start_urls = ['https://west.usd262.net/about']
custom_settings = {
"DOWNLOADER_MIDDLEWARES" : {
"scrapy.downloadermiddlewares.redirect.RedirectMiddleware":None,
PDFRedirect: 600,
}
}
def parse(self, response):
internal_le = LinkExtractor(unique=True)
in_links = internal_le.extract_links(response)
for link in in_links:
if link.url:
yield scrapy.Request(link.url, callback=self.parse)
输出
2023-11-30 15:00:35 [scrapy.core.engine] INFO: Spider opened
2023-11-30 15:00:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-30 15:00:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-30 15:00:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about> (referer: None)
2023-11-30 15:00:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://west.usd262.net/about> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.usd262.net': <GET https://www.usd262.net/staff-links1>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'abilene.usd262.net': <GET https://abilene.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'wheatland.usd262.net': <GET https://wheatland.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcis.usd262.net': <GET https://vcis.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcms.usd262.net': <GET https://vcms.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vchs.usd262.net': <GET https://vchs.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tlc.usd262.net': <GET https://tlc.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/profile.php?id=100061273524317>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/USD262>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.youtube.com': <GET https://www.youtube.com/channel/UCD8AdyKpM44gpFzqIqBG9tw>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-22-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-22-us-central1-01.preview.finalsitecdn.com/about/calendar1>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.finalsite.com': <GET https://www.finalsite.com>
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about#fsPageContent> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/privacy-policy> (referer: https://west.usd262.net/about)
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1686234716/usd262net/hdkhsv6qg1jzbobmkrxs/23-24elementaryschoolsupplylist8511in.pdf
2023-11-30 15:00:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/contact645-clone> from <GET https://west.usd262.net/fs/pages/3813>
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/accessibility-statement> (referer: https://west.usd262.net/about)
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.valleycenterhornets.net': <GET https://www.valleycenterhornets.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sideline.bsnsports.com': <GET https://sideline.bsnsports.com/schools/kansas/valleycenter/valley-center-high-school/design/picker>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-34-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-34-us-central1-01.preview.finalsitecdn.com/about>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'calendar.google.com': <GET https://calendar.google.com/calendar/embed?src=usd262.net_b07qmrijq7dq09a7s93u4qq7u0%40group.calendar.google.com&ctz=America%2FChicago>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'datacentral.ksde.org': <GET https://datacentral.ksde.org/accountability.aspx>
IGNORING PDF https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.w3.org': <GET http://www.w3.org/TR/WCAG/>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'accessibilitystatementgenerator.com': <GET http://accessibilitystatementgenerator.com>
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/parent756> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/pto> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/site-map> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/footer-links> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.infinitecampus.org': <GET https://usd262.infinitecampus.org/campus/portal/valleycenter.jsp>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net.finalsite.com': <GET https://usd262net.finalsite.com/fs/resource-manager/view/383a8f18-5ef9-4f48-815e-030300759293>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'docs.google.com': <GET https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRi840waukqIIVzL9eM4X9EoxwIsGKyuwsu83A852Mv6dMnPmjQSF0HKFRrMmpw1g/pubhtml>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.incidentiq.com': <GET https://usd262.incidentiq.com/>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'educatekansas.org': <GET https://educatekansas.org/>
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/volunteering> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/ymca-childcare> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'ymcawichita.org': <GET https://ymcawichita.org/programs/child-care-and-camps/before-and-after-school>
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/emergency-safety-interventions-bullying> (referer: https://west.usd262.net/about)
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/librarymedia-center> (referer: https://west.usd262.net/about)
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/volunteer-information> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'search.follettsoftware.com': <GET https://search.follettsoftware.com/metasearch/ui/43691>
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bookfairs.scholastic.com': <GET https://bookfairs.scholastic.com/bf/westelementaryschool11>
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.commonsensemedia.org': <GET https://www.commonsensemedia.org/>
2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/news> from <GET https://west.usd262.net/fs/pages/3814>
IGNORING PDF https://resources.finalsite.net/images/v1680193574/usd262net/skenieqeiwealjrpl210/33023ActivationInstructionforCampusPortal3.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673804004/usd262net/i0mi93dw4rp63jsem0jt/PTOMeetingMinutes1220docx.pdf
2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/contact645-clone> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/sraff-directory> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/schools> from <GET https://west.usd262.net/fs/pages/2799>
IGNORING PDF https://resources.finalsite.net/images/v1673803989/usd262net/bvokssior5jikny5ggwk/PTOMeetingMinutes2120docx.pdf
2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/report-bullying-safety-concerns> from <GET https://west.usd262.net/fs/pages/3560>
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/counseling> (referer: https://west.usd262.net/about)
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.ksde.org': <GET http://www.ksde.org/Default.aspx?tabid=149>
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.homeworkkansas.org': <GET http://www.homeworkkansas.org/>
IGNORING PDF https://resources.finalsite.net/images/v1673803943/usd262net/okuntylyovx2hn260gmt/PTOMeetingMinutes1919docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/nurses-page> (referer: https://west.usd262.net/about)
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kidshealth.org': <GET http://www.kidshealth.org/parent/firstaid_safe/>
IGNORING PDF https://resources.finalsite.net/images/v1673803972/usd262net/s8sipel9qrbd1kwqrklg/FebPTOMeetingMinutes1120docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/document-library> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1673803909/usd262net/zcygtqo4nk94alxapei2/PTOMeetingMinutes1719.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803957/usd262net/kpvyrmpdxbbwic1o9mkw/1-21-20PTOMeetingMinutes21201docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/administration> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1673803928/usd262net/aprcr3g9v0x76agcz81m/PTOMeetingMinutes2219docx.pdf
2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://west.usd262.net/about/sraff-directory> from <GET https://west.usd262.net/staff-directory>
2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://west.usd262.net> from <GET https://west.usd262.net/fs/resource-manager/view/446cdd83-e743-495f-b0f1-91318deef052>
IGNORING PDF https://resources.finalsite.net/images/v1673803888/usd262net/sun8frlao9rk4gftotnp/PTOMeetingMinutes2719.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803867/usd262net/cev2livmjpacfgyq0qrc/4202021PTOMeetingminutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803137/usd262net/km3nodsbggl5taziszk3/MicrosoftWord-TotallyCoolElementarySchool_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803819/usd262net/k5xboy8whfnymanvvuyk/MeetingminutesFeb.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803121/usd262net/u5ctbelnubnhgz9gw6wa/WestElementaryCounselingBrochurefinal-2008_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673784917/usd262net/lzwphtnhcoqjp9thds6n/FactSheet-TitleI-ParentInvolvement.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803778/usd262net/rwb5tlbdaap8e1wjiizl/NovemberPTOMeetingMinutes.pdf
2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/enrollment/student-health-information> from <GET https://west.usd262.net/fs/pages/3541>
IGNORING PDF https://resources.finalsite.net/images/v1673784914/usd262net/dkc6smzfpcylihjyl0mx/ESIBoardPolicies-19.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803487/usd262net/eyojl1bd1qdj3lp8bjki/RICE-RestIceCompresionElevation_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673784913/usd262net/mlrg9xwsotm3a6ccmazy/ESI-DocumentsforWebsite-19.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803799/usd262net/rczvldr6kah713hisfwx/JanuaryPTOMeetingMinutes.pdf
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/report-bullying-safety-concerns> (referer: https://west.usd262.net/about/emergency-safety-interventions-bullying)
IGNORING PDF https://resources.finalsite.net/images/v1673784915/usd262net/u4efohzm82jnbzzsqxd3/FERPANotificationofRights.pdf
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.p3tips.com': <GET https://www.p3tips.com/tipform.aspx?ID=217>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.crisistextline.org': <GET https://www.crisistextline.org/texting-in/>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kbi.ks.gov': <GET https://www.kbi.ks.gov/sar>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.onlinesafetyhub.io': <GET https://usd262.onlinesafetyhub.io/>
IGNORING PDF https://resources.finalsite.net/images/v1673803764/usd262net/yswpmxj1ivn5dr4onfue/OctoberPTOmeeting.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803749/usd262net/cfcylzqvhzvvsacorltx/SeptemberPTOmeetingnotes.pdf
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/news> (referer: https://west.usd262.net/)
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://west.usd262.net/fs/pages/3508> (referer: https://west.usd262.net/about/document-library)
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/schools> (referer: https://west.usd262.net/)
IGNORING PDF https://resources.finalsite.net/images/v1673803706/usd262net/doln0ockhdm39lkfntxm/NovPTOmeetingminutes162021.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803735/usd262net/zltnhhnyt2jz1fi8k8gy/MarchPTOMeetingMinutes222022.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803693/usd262net/fel78cnko0opxf96lefx/OctPTOMeetingminutes192021.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803721/usd262net/idowphs1sgrl2xrnellg/JanPTOMeetingMinutes1820221.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803679/usd262net/e6jc2mep0odspayjzxmo/SeptPTOMeetingMinutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803649/usd262net/eultmjehz33n29yf5nqt/PTOMeetingMinutes2020221.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803664/usd262net/x6uh9b9s0lxpm3h8nmdx/AugustthPTOMinutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803634/usd262net/izumknwsghgbzuouu4ui/PTOMeetingMinutes2320221.pdf
2023-11-30 15:00:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/enrollment/student-health-information> (referer: https://west.usd262.net/about/nurses-page)
2023-11-30 15:00:44 [scrapy.core.engine] INFO: Closing spider (finished)
2023-11-30 15:00:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 38365,
'downloader/request_count': 65,
'downloader/request_method_count/GET': 65,
'downloader/response_bytes': 248536,
'downloader/response_count': 65,
'downloader/response_status_count/200': 24,
'downloader/response_status_count/301': 6,
'downloader/response_status_count/302': 34,
'downloader/response_status_count/404': 1,
'dupefilter/filtered': 402,
'elapsed_time_seconds': 8.931808,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 11, 30, 23, 0, 44, 907376),
'httpcompression/response_bytes': 795436,
'httpcompression/response_count': 25,
'log_count/DEBUG': 69,
'log_count/INFO': 10,
'offsite/domains': 35,
'offsite/filtered': 962,
'request_depth_max': 3,
'response_received_count': 25,
'scheduler/dequeued': 65,
'scheduler/dequeued/memory': 65,
'scheduler/enqueued': 65,
'scheduler/enqueued/memory': 65,
'start_time': datetime.datetime(2023, 11, 30, 23, 0, 35, 975568)}