获取:使用 Scrapy 抓取笔记本电脑数据时爬行(302)

问题描述 投票:0回答:1

我想从 https://www.newegg.com/tools/laptop-finder 抓取屏幕类型和标题等数据 但我被困住了,因为我的脚本被抓取但没有被抓取

网站的HTML代码是

<tr>
    <td class="td-item">
        <a class="goods-info" href="https://www.newegg.com/p/N82E16834156430?Item=N82E16834156430" data-toggle="modal" data-target="#modal-pc-builder-pdp">
            <div class="goods-img">
                <img src="https://c1.neweggimages.com/ProductImageCompressAll125/34-156-430-03.jpg" alt="MSI Katana 15 B12VGK-082US 15.6&quot; Gaming Laptop">
            </div>
            <div class="goods-title">
                <div class="goods-title-content">MSI Katana 15 B12VGK-082US 15.6" Gaming Laptop</div>
                <div class="goods-rating">
                    <i class="rating rating-4" aria-label="rated 4 out of 5"></i>
                    <span class="goods-rating-num font-s text-gray">(31)</span>
                </div>
            </div>
        </a>
    </td>
    <td class="td-spec"><div class="hid-text">Screen Size</div><span>15.6"</span></td>
    <td class="td-spec"><div class="hid-text">CPU type</div><span>Intel Core i7 12th Gen</span></td>
    <td class="td-spec"><div class="hid-text">Memory</div><span>16GB</span></td>
    <td class="td-spec"><div class="hid-text">Storage</div><span>1 TB PCIe</span></td>
    <td class="td-spec"><div class="hid-text">GPU</div><span>NVIDIA GeForce RTX 4070 Laptop GPU</span></td>
    <td class="td-spec"><div class="hid-text">Resolution</div><span>1920 x 1080</span></td>
    <td class="td-spec"><div class="hid-text">Weight</div><span>4 - 4.9 lbs.</span></td>
    <td class="td-spec"><div class="hid-text">Backlit Keyboard</div><span>Backlit</span></td>
    <td class="td-spec"><div class="hid-text">Touchscreen</div><span>No</span></td>
    <td class="td-spec"><div class="hid-text">CPU Speed</div><span>12650H (2.30GHz)</span></td>
    <td class="td-spec"><div class="hid-text">Number of Cores</div><span>10-core (6P+4E) Processor</span></td>
    <td class="td-spec"><div class="hid-text">Color</div><span>Black</span></td>
    <td class="td-spec"><div class="hid-text">Display Type</div><span>Full HD</span></td>
    <td class="td-spec"><div class="hid-text">Graphic Type</div><span>Dedicated Card</span></td>
    <td class="td-spec"><div class="hid-text">Operating System</div><span>Windows 11 Home</span></td>
    <td class="td-spec"><div class="hid-text">Webcam</div><span>Yes</span></td>
    <td class="td-action">
        <div class="item-action grid col-w-3">
            <div class="goods-price-current hide-click-for-details">
                <div class="goods-price font-s">
                    <div class="goods-price-current">
                        <span class="goods-price-label"></span>
                        <span class="goods-price-symbol">$</span>
                        <span class="goods-price-value"><strong>1,159</strong><sup>.00</sup></span>
                    </div>
                </div>
            </div>
            <div class="goods-operate xxs-hide">
                <div class="goods-button-area">
                    <label class="input-check input-check-s compare-check">
                        <input type="checkbox" autocomplete="off" aria-label="checkbox">
                        <span class="input-check-mark text-hide">checkmark</span>
                        <div class="input-check-text">Compare</div>
                    </label>
                    <button title="Add MSI Katana 15 B12VGK-082US 15.6&quot; 144 Hz IPS Intel Core i7 12th Gen 12650H (2.30GHz) NVIDIA GeForce RTX 4070 Laptop GPU 16GB Memory 1 TB NVMe SSD Windows 11 Home 64-bit Gaming Laptop to cart" class="button button-s bg-orange">Add to cart</button>
                </div>
            </div>
        </div>
    </td>
</tr>

因为我刚刚学习抓取,所以我只抓取了一台笔记本电脑的标题和屏幕尺寸 下面是我的scrapy代码

import scrapy

class LaptopSpider(scrapy.Spider):

    name = "laptop"
    headers = {
        "authority": "ssl.doas.state.ga.us",
        "pragma": "no-cache",
        "cache-control": "no-cache",
        "sec-ch-ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"",
        "accept": "application/json, text/javascript, */*; q=0.01",
        "x-requested-with": "XMLHttpRequest",
        "sec-ch-ua-mobile": "?0",
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
        "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
        "origin": "https://ssl.doas.state.ga.us",
        "sec-fetch-site": "same-origin",
        "sec-fetch-mode": "cors",
        "sec-fetch-dest": "empty",
        "referer": "https://ssl.doas.state.ga.us/gpr/",
        "accept-language": "en-US,en;q=0.9"
    }
    start_urls = ['https://www.newegg.com/tools/laptop-finder']
    custom_settings = {'REDIRECT_ENABLED': False}
    handle_httpstatus_list = [302]

    def parse(self, response):
        product = response.css('tr td.td-item')

        for item in product:
            yield {
                'Title': item.css('.goods-title-content::text').get(),
                'Screen Size': item.xpath('.//div[text()="Screen Size"]/following-sibling::span/text()').get(),
            }

我的日志文件是

2023-09-10 10:12:28 [scrapy.core.engine] INFO: Spider opened
2023-09-10 10:12:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-09-10 10:12:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-09-10 10:12:29 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://www.newegg.com/tools/laptop-finder> (referer: None)
2023-09-10 10:12:29 [scrapy.core.engine] INFO: Closing spider (finished)
2023-09-10 10:12:29 [scrapy.extensions.feedexport] INFO: Stored json feed (0 items) in: j.json
2023-09-10 10:12:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

帮帮我吧

python web-scraping scrapy screen-scraping
1个回答
0
投票

这个问题不可能给出明确、直接的答案。

你需要以这些知识为基础:

  1. 302状态表示重定向。通常,服务器可以使用它来设置 cookie。因此,无需禁用重定向并启用 302 响应处理。

  2. 您可以使用一些方法来调试代码https://docs.scrapy.org/en/latest/topics/debug.html

  3. 我建议使用 start_requests 初始方法将标头传递到 url 的第一个请求中https://www.newegg.com/tools/laptop-finder

© www.soinside.com 2019 - 2024. All rights reserved.