我正在尝试从下面提到的网站复制数据,我需要各种尺寸,价格,设施,特价,储备。我在代码下面框架,但我能够正确复制元素。第一件事只有三个元素正在处理重复,我也没有得到设施和储备的结果。有人可以看一下吗?
Sub text()
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
.Visible = True
.Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
While .Busy Or .readyState < 4: DoEvents: Wend
Sheets("Unit Data").Select
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features", "Specials", "Price", "Reserve")
Set list = .document.getElementsByClassName("units_table")
'.unit_size medium, .features, .Specials, .price, .Reserve
Dim rowCount As Long
rowCount = .document.querySelectorAll(".tab_container li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByClassName("unitinfo even")
r = r + 1
results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub
tl;博士
事先(对某些人)抱歉答案长度,但我想我会把这个教育时刻详细说明发生了什么。
我使用的整体方法与您的代码中的相同:找到一个css选择器来隔离行(尽管在不同的选项卡中,小,中,大实际上仍然存在于页面上):
Set listings = html.querySelectorAll(".unitinfo")
以上生成行。和以前一样,我们将其转换为新的HTMLDocument
,以便我们可以利用querySelector/querySelectorAll
方法。
行:
我们来看看我们正在检索的第一行html。后续部分将以此行为案例研究,以了解如何检索信息:
5x5</TD> <TD class=features>
<DIV id=a5x5-1 class="icon a5x5">
<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>
<DIV class=display>
<P>More Information</P></DIV></DIV>
<SCRIPT type=text/javascript>
// Refine Search
//
$(function() {
$("#a5x5-1").tooltip({
track: false,
delay: 0,
showURL: false,
left: 5,
top: 5,
bodyHandler: function () {
return " <div class=\"tooltip\"> <div class=\"tooltop\"></div> <div class=\"toolmid clearfix\"> <div class=\"toolcontent\"> <div style=\"text-align:center;width:100%\"> <img alt=\"5 x 5 storage unit\" src=\"/core/resources/images/units/5x5.png\" /> </div> <div class=\"display\">5 x 5</div> <div class=\"description\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div> </div> <div class=\"clearfix\"></div> </div> <div class=\"toolfoot\"></div> <div class=\"clearfix\"></div> </div> "}
});
});
</SCRIPT>
</TD><TD class=rates>
<DIV class="discount_price secondary-color-text standard_price--left">
<DIV class=price_text>Web Rate: </DIV>
<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>
<SCRIPT>
$( document ).ready(function() {
$('.units_table tr.unitinfo').each(function(index, el) {
if ($(this).find('.standard_price').length != '' && $(this).find('.discount_price').length != '') {
$(this).parents('.units_table').addClass('both');
$(this).addClass('also-both');
$(this).find('.rates').addClass('rates_two_column');
}
});
});
</SCRIPT>
</TD><TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>
<DIV class=offer1>Call for Specials </DIV>
<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>
我们要使用的每一行都会在html2
变量中包含类似的html。如果您有疑问请查看上面显示的函数中的javascript:
$('.units_table tr.unitinfo').each(function(index, el)
它使用相同的选择器(但也指定父表类和元素类型(tr
))。基本上,正在为表中的每一行调用该函数。
尺寸:
由于某种原因,开放的td
标签因为尺寸而被丢弃,而不是按类抓取,我正在寻找结束标签的开始并将字符串提取到那里。我这样做是通过将Instr(其中<在字符串中找到)-1给出的返回值传递给Left$
(类型)函数。
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
这将返回5x5
。
描述:
描述列由我们在上面看到的函数填充(适用于每行记住)
这一点 - $("#a5x5-1").tooltip
- 告诉它在哪里定位,然后函数的return语句提供了一个html,它有一个div
,类description
,包含我们想要的文本。由于我们没有使用浏览器,而且我在64位窗口,我无法评估此脚本,但我可以使用split
提取"description\">
和关闭关联div
标记的开头之间的字符串(描述):
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
返回:
“把它想象成一个标准的壁橱。大约25平方英尺,这个空间非常适合十几个盒子,桌子,椅子和自行车。”
价格类型和价格:
这些是直截了当的,并使用类名来定位:
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
返回(分别)
网络费率,£39.00
设施:
这是事情有点棘手的地方。
让我们重新检查上面显示的与此行相关的html:
<TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD>
我们可以看到父母td
有一类amenities
,它有儿童div
元素,具有复合类名;后者在每种情况下都用作舒适型的标识符,例如icon_climate
。
当您将鼠标悬停在这些页面上时,会显示工具提示信息:
我们可以在实际页面的html中跟踪此工具提示的位置:
当您将鼠标悬停在不同的设施上时,此内容会更新
长话短说(他说在页面的一半!),这个内容正在从服务器上的php文件更新。我们可以请求该文件并构建一个字典,该字典映射每个设施的类名,例如amenity_icon icon_climate
(复合类需要“替换为”。“转换为.amenity_icon.icon_climate
的相应css选择器时)到相关描述。您可以探索php文件here。
php文件:
让我们看一下文件的开头,以便剖析重复模式的基本单位:
function LoadTooltips() {
$(".units_table .amenity_icon.icon_climate").tooltip({
track: false,
delay: 0,
showURL: false,
left: -126,
top: -100,
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
}
});
负责更新工具提示的功能是LoadTooltips
。 CSS类选择器用于定位每个图标:
$(".units_table .amenity_icon.icon_climate").tooltip
我们有bodyhandler指定返回文本:
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
我们有三位有用的信息出现在重复的组中。元素的类名选择器,简短描述和长描述,例如
.amenity_icon.icon_climate
:我们用它来将php文件描述映射到我们行中的amenity图标的类名。 CSS选择器Temperature Controlled
;在h4
标签中的tooltip函数返回文本。简短的介绍Units are heated and/or cooled. See manager for details.
;在p
标签中的tooltip函数返回文本。详细描述我编写了两个函数,GetMatches
和GetAmenitiesDescriptions
,它们使用正则表达式为每个图标提取所有重复项,并返回一个字典,其中css选择器作为键,短description : long description
作为值。
当我收集每行中的所有图标时:
Set icons = html2.querySelectorAll(".amenity_icon")
我使用字典返回基于图标类名的工具提示描述
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
然后,我将使用vbNewLine
加入描述,以确保输出位于输出单元格的不同行。
你可以探索正则表达式here。
正则表达式使用|
(Or)语法,因此我将所有匹配的模式返回到单个列表中。
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
因为我想要不同的子匹配(0,1或2 a.k.a css类选择器,短desc,long desc)我使用带有计数器变量Select Case i mod 3
的i
来提取适当的子匹配。
php文件中映射的匹配示例:
特价:
我们回到了班级选择者。 Offer2
没有填充,所以你可以删除。
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
返回(分别):
呼唤特价,空字符串
闭幕致辞:
所以,上面的内容将引导您完成一行。只需冲洗并在所有行的循环中重复。为了提高效率,数据被添加到一个数组results
;然后一次性写入Sheet1
。我可以看到一些小的改进,但这很快。
VBA:
Option Explicit
Public Sub GetInfo()
Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object
Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
Set amenitiesDescriptions = GetAmenitiesDescriptions
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
Dim headers(), results(), listings As Object, amenities As String
headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")
Set listings = html.querySelectorAll(".unitinfo")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
Dim html2 As HTMLDocument
Set html2 = New HTMLDocument
For item = 0 To listings.Length - 1
r = r + 1
html2.body.innerHTML = listings.item(item).innerHTML
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
Set icons = html2.querySelectorAll(".amenity_icon")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
amenities = Join$(amenitiesInfo, vbNewLine) 'place each amenity description on a new line within cell when written out
results(r, 5) = amenities
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetAmenitiesDescriptions() As Object 'retrieve amenities descriptions from php file on server
Dim s As String, dict As Object, re As Object, i As Long, arr() 'keys based on classname, short desc, full desc
' view regex here: https://regex101.com/r/bII5AL/1
Set dict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
For i = LBound(arr) To UBound(arr) Step 3 'build up lookup dictionary for amenities descriptions
dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)
Next
End With
Set GetAmenitiesDescriptions = dict
End Function
Public Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = sPattern
If .test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
Select Case i Mod 3
Case 0
arrMatches(i) = iMatch.SubMatches.item(0)
Case 1
arrMatches(i) = iMatch.SubMatches.item(1)
Case 2
arrMatches(i) = iMatch.SubMatches.item(2)
End Select
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
输出:
参考文献(VBE>工具>参考文献):
这是一种方法:
Sub test()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim targetTable As HTMLTable
Dim tableRow As HTMLTableRow
Dim tableCell As HTMLTableCell
Dim element As HTMLDivElement
Dim sht As Worksheet
Dim amenitiesString As String
Dim i As Long
Dim j As Long
Set sht = ThisWorkbook.Worksheets("Sheet1")
With req
.Open "GET", "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955", False
.send
doc.body.innerHTML = .responseText
End With
Set targetTable = doc.getElementById("units_small_units") 'You can use units_medium_units or units_large_units to get the info from the other tabs
i = 0
For Each tableRow In targetTable.Rows
i = i + 1
j = 0
For Each tableCell In tableRow.Cells
amenitiesString = ""
j = j + 1
If tableCell.className = "amenities" And tableCell.innerText <> "Amenities" Then
For Each element In tableCell.getElementsByTagName("div")
amenitiesString = amenitiesString & element.Title & ","
Next element
sht.Cells(i, j).Value = amenitiesString
ElseIf tableCell.className <> "features" Then
sht.Cells(i, j).Value = tableCell.innerText
End If
Next tableCell
Next tableRow
End Sub
我正在使用HTTP请求而不是Internet Explorer来获取HTML。除此之外,我认为你可以了解如何访问你想要的元素。
这是结果的屏幕截图。
演示文稿有点原始,但你明白了:-P
基本上这个:
listing.getElementsByClassName("amenities")(0).innerText
将返回一个空白,因为这些元素中没有内部文本。信息由脚本生成,但也可以在title
元素的div
中找到。
使用的参考文献:
Microsoft HTML Object Library
和WinHTTP Services Version 5.1
你可以尝试Jquery获得如下方法:
$ .get('url',function(data){
// Loop through elements
$(data).find("ul").find("li").each( function(){
var text = $(this).text();
} )
} );