使用带有动态/部分已知 URL 循环的 MS XML、6.0 库发送 HTTP 请求

问题描述 投票:0回答:1

我正在尝试在 www.statmuse.com 上循环浏览投手的比赛日志。主要问题是,由于我试图循环执行此操作,因此部分 URL 目前未知。

例如,查看 Martín Pérez 2024 年的比赛日志,网址为:https://www.statmuse.com/mlb/player/martin-perez-46483/game-log

现在,在尝试循环不同的投手时,这个 5 位数字序列(在我的示例中为 46483)是可变的,并且在每个投手的比赛日志之间发生变化。

我整理了以下代码。当然,问题是循环遍历 10000 和 99999,试图找到正确的 5 位数字序列,这导致我的 Excel 崩溃并且没有响应。谁能建议一种更有效的方法来实现这一目标?我很抱歉,这是我第一个使用 HTTP 请求和类似内容的项目,所以我确信代码是一团糟。

代码:

Dim ws As Worksheet, PLws As Worksheet
Set ws = ThisWorkbook.Sheets("Sheet1")
Set PLws = ThisWorkbook.Sheets("Pitcher List T")

Set rng = PLws.Range("B1:B1")
For Each cc In rng

    Dim httpRequest As MSXML2.XMLHTTP60:    Set httpRequest = New MSXML2.XMLHTTP60
    Dim htmldoc As HTMLDocument:            Set htmldoc = New HTMLDocument
    
    playerName = CStr(cc.Value)
    Dim baseURL As String
    baseURL = "https://www.statmuse.com/mlb/player/" & playerName & "-"

    Dim lastRow As Long
    Dim startNumber As Long
    Dim endNumber As Long
    startNumber = 10000 ' this loop is the issue (i think)
    endNumber = 99999   ' this loop is the issue (i think)

Dim i As Long, target As Long
For i = startNumber To endNumber ' this loop is the issue (i think)

    Dim url As String
    url = baseURL & CStr(i) & "/game-log"
    
    If CheckUrlExists(url) Then
        target = i
        Debug.Print "the target i is: " & target
    End If

Next i
    
    Dim Murl As String
    Murl = baseURL & target & "/game-log"
    
    httpRequest.Open "GET", Murl, False
    httpRequest.send
    htmldoc.body.innerHTML = httpRequest.responseText

这是功能:

Public Function CheckUrlExists(url) As Boolean
        
    On Error GoTo CheckUrlExists_Error
    
    Dim xmlhttp As MSXML2.XMLHTTP60:    Set xmlhttp = New MSXML2.XMLHTTP60
    Dim htmldoc As HTMLDocument:            Set htmldoc = New HTMLDocument
    Dim H2el As Object
    
    xmlhttp.Open "GET", url, False
    xmlhttp.send
    htmldoc.body.innerHTML = xmlhttp.responseText
    
    If xmlhttp.Status = 200 Then
        For Each H2el In htmldoc.getElementsByTagName("h2")
            If InStr(1, ChangeAccent(H2el.innerText), CStr(cc.Offset(0, -1).Value)) > 0 Then
                CheckUrlExists = True
            End If
        Next H2el
    Else
        CheckUrlExists = False
    End If
    
    Exit Function
    
CheckUrlExists_Error:
    CheckUrlExists = False
    
End Function

谢谢你

excel vba http xmlhttprequest msxml
1个回答
0
投票
  • URL中的序列号并不总是5位数字。例如。

https://www.statmuse.com/nba/player/lebron-james-1780

  • 对于每个 POST 请求,远程服务器对 200(正常)或 502(错误网关)状态代码的响应时间各不相同。在我的测试中,通常需要半秒到一秒。使用此方法验证 90,000 个 URL 是不切实际的。

  • 确实,尝试验证大量 URL 可能会给网站服务器带来巨大负担。事实上,它可以被解释为一种小型 DDoS(分布式拒绝服务)攻击。

  • 下面的代码使用

    search
    模拟 https://www.statmuse.com/ 上的
    Internet Explorer
    ,虽然有点过时,但仍然可以完成工作。 Selenium WebDriver可能是更好的选择。

Sub Demo()
    Dim t: t = Timer
    Debug.Print GetURL("martin perez")
    Debug.Print GetURL("lebron james")
    Debug.Print Timer - t
End Sub
Function GetURL(ByVal sFullName As String) As String
    Dim IE As Object
    Dim doc As Object
    ' Initialize Internet Explorer
    Set IE = CreateObject("InternetExplorer.Application")
    IE.Visible = False
    ' Navigate to the webpage
    IE.Navigate "https://www.statmuse.com/ask?q=" & Replace(sFullName, " ", "-")
    ' Wait for IE to finish loading the page
    Do While IE.Busy Or IE.readyState <> 4
        DoEvents
    Loop
    ' Get the document object
    Set doc = IE.Document
    ' Get the HTML content of the webpage
    'Dim htmlContent As String:  htmlContent = doc.DocumentElement.outerHTML
    Dim oLink As Object
    Set oLink = doc.getElementsByTagName("link")(0)
    ' Extract the href attribute value from the link tag
    If Not oLink Is Nothing Then
        GetURL = oLink.getAttribute("href")
    Else
        GetURL = ""
    End If
    ' Close IE
    IE.Quit
    Set IE = Nothing
End Function

输出:

https://www.statmuse.com/mlb/player/mart%C3%ADn-p%C3%A9rez-46483
https://www.statmuse.com/nba/player/lebron-james-1780
© www.soinside.com 2019 - 2024. All rights reserved.