Scrapy Scrape元素未知数量

Question

我想在Shopee上搜索一个网站列表。一些例子包括dudesgadget和2ubest。这些shopee商店中的每一个都有不同的设计和构建其web元素和不同领域的方式。它们看起来像独立的网站但实际上并非如此。

所以这里的主要问题是我试图抓住产品细节。我将总结一些不同的结构：

2ubest

<html>
    <body>
        <div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
            <main class="wrapper main-content" role="main">
                <div class="grid">
                    <div class="grid__item">
                        <div id="shopify-section-product-template" class="shopify-section">
                            <script id="ProductJson-product-template" type="application/json">
                                //Things I am looking for
                            </script>
                        </div>
                    </div>
                </div>
            </main>
        </div>
    </body>
</html>

littleplayland

<html>
    <body id="adjustable-ergonomic-laptop-stand" class="template-product">
        <script>
            //Things I am looking for
        </script>
    </body>
</html>

还有其他一些，我发现它们之间存在一种模式。

我正在寻找的东西肯定会在<body>
我正在寻找的东西是在<script>内
我唯一不确定的是从<body>到<script>的距离

我的解决方案是：

def parse(self, response):
    body = response.xpath("//body")
    for script in body.xpath("//script/text()").extract():
        #Manipulate the script with js2xml here

我能够提取littleplayland，dailysteals和许多其他距离<body>到<script>的距离非常小，但不适用于2ubest，其中有很多其他html元素介于我正在寻找的东西之间。我能否知道是否有解决方案可以忽略其间的所有html元素并且只查找<script>标签？

我需要一个通用的解决方案，如果可能的话，可以在所有Shopee网站上运行，因为它们都具有我上面提到的特征。

这意味着该解决方案不应使用<div>进行过滤，因为每个不同的网站都有不同数量的<div>

Answer 1

这是使用Scrapy在HTML中获取脚本的方法：

scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()

for script in theScripts:
    #Manipulate the script with js2xml here
    print("------->A SCRIPT STARTS HERE<--------")
    print(script)
    print("------->A SCRIPT ENDS HERE<--------")

以下是您问题中HTML的示例（我添加了一个额外的脚本:)）：

import scrapy

text="""<html>
    <body>
        <div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
            <main class="wrapper main-content" role="main">
                <div class="grid">
                    <div class="grid__item">
                        <div id="shopify-section-product-template" class="shopify-section">
                            <script id="ProductJson-product-template" type="application/json">
                                //Things I am looking for
                            </script>
                        </div>
                        <script id="script 2">I am another script</script>
                    </div>
                </div>
            </main>
        </div>
    </body>
</html>"""

scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()

for script in theScripts:
    #Manipulate the script with js2xml here
    print("------->A SCRIPT STARTS HERE<--------")
    print(script)
    print("------->A SCRIPT ENDS HERE<--------")

Answer 2

1
投票

试试这个：

//body//script/text()

Scrapy Scrape元素未知数量

问题描述投票：1回答：2

2个回答

最新问题

Scrapy Scrape元素未知数量

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2