Web Scraping Elements按类和标签名称

问题描述 投票:4回答:3

我正在尝试从下面提到的网站复制数据,我需要各种尺寸,价格,设施,特价,储备。我在代码下面框架,但我能够正确复制元素。第一件事只有三个元素正在处理重复,我也没有得到设施和储备的结果。有人可以看一下吗?

Sub text()


Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
    .Visible = True
    .Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955" 

    While .Busy Or .readyState < 4: DoEvents: Wend

    Sheets("Unit Data").Select


    Dim listings As Object, listing As Object, headers(), results()
    Dim r As Long, list As Object, item As Object
    headers = Array("size", "features", "Specials", "Price", "Reserve")
    Set list = .document.getElementsByClassName("units_table")
    '.unit_size medium, .features, .Specials, .price, .Reserve
    Dim rowCount As Long
    rowCount = .document.querySelectorAll(".tab_container li").Length
    ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
    For Each listing In list
            For Each item In listing.getElementsByClassName("unitinfo even")
            r = r + 1

          results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
          results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
           results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
        results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
          results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText





        Next
    Next
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    .Quit
End With

  Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub
html excel vba web-scraping screen-scraping
3个回答
2
投票

tl;博士

事先(对某些人)抱歉答案长度,但我想我会把这个教育时刻详细说明发生了什么。

我使用的整体方法与您的代码中的相同:找到一个css选择器来隔离行(尽管在不同的选项卡中,小,中,大实际上仍然存在于页面上):

Set listings = html.querySelectorAll(".unitinfo")

以上生成行。和以前一样,我们将其转换为新的HTMLDocument,以便我们可以利用querySelector/querySelectorAll方法。


行:

我们来看看我们正在检索的第一行html。后续部分将以此行为案例研究,以了解如何检索信息:

5x5</TD> <TD class=features>
<DIV id=a5x5-1 class="icon a5x5">
<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>
<DIV class=display>
<P>More Information</P></DIV></DIV>
<SCRIPT type=text/javascript>
                // Refine Search
                //
                $(function() {
                    $("#a5x5-1").tooltip({
                        track: false,
                        delay: 0,
                        showURL: false,
                        left: 5,
                        top: 5,
                        bodyHandler: function () {
                            return "        <div class=\"tooltip\">            <div class=\"tooltop\"></div>            <div class=\"toolmid clearfix\">                <div class=\"toolcontent\">                    <div style=\"text-align:center;width:100%\">                        <img alt=\"5 x 5 storage unit\" src=\"/core/resources/images/units/5x5.png\" />                    </div>                    <div class=\"display\">5 x 5</div>                    <div class=\"description\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div>                </div>                <div class=\"clearfix\"></div>            </div>            <div class=\"toolfoot\"></div>            <div class=\"clearfix\"></div>        </div>        "}
                    });
                });
        </SCRIPT>
</TD><TD class=rates>
<DIV class="discount_price secondary-color-text standard_price--left">
<DIV class=price_text>Web Rate: </DIV>
<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>
<SCRIPT>
$( document ).ready(function() {
    $('.units_table tr.unitinfo').each(function(index, el) {
        if ($(this).find('.standard_price').length != '' && $(this).find('.discount_price').length != '') {
            $(this).parents('.units_table').addClass('both');
            $(this).addClass('also-both');
            $(this).find('.rates').addClass('rates_two_column');
        }
    });
});
</SCRIPT>
</TD><TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>
<DIV class=offer1>Call for Specials </DIV>
<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>

我们要使用的每一行都会在html2变量中包含类似的html。如果您有疑问请查看上面显示的函数中的javascript:

$('.units_table tr.unitinfo').each(function(index, el) 

它使用相同的选择器(但也指定父表类和元素类型(tr))。基本上,正在为表中的每一行调用该函数。


尺寸:

由于某种原因,开放的td标签因为尺寸而被丢弃,而不是按类抓取,我正在寻找结束标签的开始并将字符串提取到那里。我这样做是通过将Instr(其中<在字符串中找到)-1给出的返回值传递给Left$(类型)函数。

enter image description here

results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)

这将返回5x5


描述:

描述列由我们在上面看到的函数填充(适用于每行记住)

这一点 - $("#a5x5-1").tooltip - 告诉它在哪里定位,然后函数的return语句提供了一个html,它有一个div,类description,包含我们想要的文本。由于我们没有使用浏览器,而且我在64位窗口,我无法评估此脚本,但我可以使用split提取"description\">和关闭关联div标记的开头之间的字符串(描述):

results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)

返回:

“把它想象成一个标准的壁橱。大约25平方英尺,这个空间非常适合十几个盒子,桌子,椅子和自行车。”


价格类型和价格:

这些是直截了当的,并使用类名来定位:

results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)

返回(分别)

网络费率,£39.00


设施:

这是事情有点棘手的地方。

让我们重新检查上面显示的与此行相关的html:

<TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD>

我们可以看到父母td有一类amenities,它有儿童div元素,具有复合类名;后者在每种情况下都用作舒适型的标识符,例如icon_climate

当您将鼠标悬停在这些页面上时,会显示工具提示信息:

enter image description here

我们可以在实际页面的html中跟踪此工具提示的位置:

enter image description here

当您将鼠标悬停在不同的设施上时,此内容会更新

长话短说(他说在页面的一半!),这个内容正在从服务器上的php文件更新。我们可以请求该文件并构建一个字典,该字典映射每个设施的类名,例如amenity_icon icon_climate(复合类需要“替换为”。“转换为.amenity_icon.icon_climate的相应css选择器时)到相关描述。您可以探索php文件here

php文件:

让我们看一下文件的开头,以便剖析重复模式的基本单位:

function LoadTooltips() {
        $(".units_table .amenity_icon.icon_climate").tooltip({
        track: false,
        delay: 0,
        showURL: false,
        left: -126,
        top: -100,
        bodyHandler: function () {
            return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
        }
    });

负责更新工具提示的功能是LoadTooltips。 CSS类选择器用于定位每个图标:

$(".units_table .amenity_icon.icon_climate").tooltip

我们有bodyhandler指定返回文本:

bodyHandler: function () {
            return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"

我们有三位有用的信息出现在重复的组中。元素的类名选择器,简短描述和长描述,例如

  1. .amenity_icon.icon_climate:我们用它来将php文件描述映射到我们行中的amenity图标的类名。 CSS选择器
  2. Temperature Controlled;在h4标签中的tooltip函数返回文本。简短的介绍
  3. Units are heated and/or cooled. See manager for details.;在p标签中的tooltip函数返回文本。详细描述

我编写了两个函数,GetMatchesGetAmenitiesDescriptions,它们使用正则表达式为每个图标提取所有重复项,并返回一个字典,其中css选择器作为键,短description : long description作为值。

当我收集每行中的所有图标时:

Set icons = html2.querySelectorAll(".amenity_icon")

我使用字典返回基于图标类名的工具提示描述

For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
    amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next        

然后,我将使用vbNewLine加入描述,以确保输出位于输出单元格的不同行。

你可以探索正则表达式here

正则表达式使用|(Or)语法,因此我将所有匹配的模式返回到单个列表中。

arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")

因为我想要不同的子匹配(0,1或2 a.k.a css类选择器,短desc,long desc)我使用带有计数器变量Select Case i mod 3i来提取适当的子匹配。

php文件中映射的匹配示例:

enter image description here


特价:

我们回到了班级选择者。 Offer2没有填充,所以你可以删除。

results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText

返回(分别):

呼唤特价,空字符串


闭幕致辞:

所以,上面的内容将引导您完成一行。只需冲洗并在所有行的循环中重复。为了提高效率,数据被添加到一个数组results;然后一次性写入Sheet1。我可以看到一些小的改进,但这很快。


VBA:

Option Explicit
Public Sub GetInfo()
    Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object
    Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    Set amenitiesDescriptions = GetAmenitiesDescriptions

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText

        html.body.innerHTML = s

        Dim headers(), results(), listings As Object, amenities As String

        headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")
        Set listings = html.querySelectorAll(".unitinfo")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1

        ReDim results(1 To rowCount, 1 To numColumns)
        Dim html2 As HTMLDocument
        Set html2 = New HTMLDocument
        For item = 0 To listings.Length - 1
            r = r + 1
            html2.body.innerHTML = listings.item(item).innerHTML
            results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
            results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
            results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
            results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)

            Set icons = html2.querySelectorAll(".amenity_icon")
            ReDim amenitiesInfo(0 To icons.Length - 1)

            For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
                amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
            Next

            amenities = Join$(amenitiesInfo, vbNewLine) 'place each amenity description on a new line within cell when written out

            results(r, 5) = amenities
            results(r, 6) = html2.querySelector(".offer1").innerText
            results(r, 7) = html2.querySelector(".offer2").innerText
        Next

        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetAmenitiesDescriptions() As Object 'retrieve amenities descriptions from php file on server
    Dim s As String, dict As Object, re As Object, i As Long, arr() 'keys based on classname, short desc, full desc
    ' view regex here: https://regex101.com/r/bII5AL/1
    Set dict = CreateObject("Scripting.Dictionary")
    Set re = CreateObject("vbscript.regexp")

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText

        arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
        For i = LBound(arr) To UBound(arr) Step 3  'build up lookup dictionary for amenities descriptions
            dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)
        Next
    End With
    Set GetAmenitiesDescriptions = dict
End Function

Public Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant
    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long

    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .Pattern = sPattern
        If .test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                Select Case i Mod 3
                Case 0
                    arrMatches(i) = iMatch.SubMatches.item(0)
                Case 1
                    arrMatches(i) = iMatch.SubMatches.item(1)
                Case 2
                    arrMatches(i) = iMatch.SubMatches.item(2)
                End Select
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

输出:

enter image description here


参考文献(VBE>工具>参考文献):

  1. Microsoft HTML对象库

1
投票

这是一种方法:

Sub test()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim targetTable As HTMLTable
Dim tableRow As HTMLTableRow
Dim tableCell As HTMLTableCell
Dim element As HTMLDivElement
Dim sht As Worksheet
Dim amenitiesString As String
Dim i As Long
Dim j As Long
Set sht = ThisWorkbook.Worksheets("Sheet1")
With req
    .Open "GET", "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955", False
    .send
    doc.body.innerHTML = .responseText
End With

Set targetTable = doc.getElementById("units_small_units") 'You can use units_medium_units or units_large_units to get the info from the other tabs
i = 0
For Each tableRow In targetTable.Rows
    i = i + 1
    j = 0
    For Each tableCell In tableRow.Cells
    amenitiesString = ""
    j = j + 1
        If tableCell.className = "amenities" And tableCell.innerText <> "Amenities" Then
            For Each element In tableCell.getElementsByTagName("div")
                amenitiesString = amenitiesString & element.Title & ","
            Next element
            sht.Cells(i, j).Value = amenitiesString
        ElseIf tableCell.className <> "features" Then
            sht.Cells(i, j).Value = tableCell.innerText
        End If
    Next tableCell
Next tableRow

End Sub

我正在使用HTTP请求而不是Internet Explorer来获取HTML。除此之外,我认为你可以了解如何访问你想要的元素。

这是结果的屏幕截图。

enter image description here

演示文稿有点原始,但你明白了:-P

基本上这个:

listing.getElementsByClassName("amenities")(0).innerText

将返回一个空白,因为这些元素中没有内部文本。信息由脚本生成,但也可以在title元素的div中找到。

使用的参考文献:

Microsoft HTML Object LibraryWinHTTP Services Version 5.1


0
投票

你可以尝试Jquery获得如下方法:

$ .get('url',function(data){

// Loop through elements
$(data).find("ul").find("li").each( function(){

    var text = $(this).text();

} )

} );

© www.soinside.com 2019 - 2024. All rights reserved.