刮刮单页网站

问题描述 投票:-4回答:2

我想从bet365.com获取数据,但问题是当我下载页面源时,页面源不包含该数据。当我搜索时,在单页面应用程序中,所有内容都不会立即加载。我尝试了下面的代码但是没有获得所需的数据。有人可以帮忙吗?

    public string GetGeneratedHTML(string url)
    {
        URL = url;
        Thread t = new Thread(new ThreadStart(WebBrowserThread));
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();

        return GeneratedSource;
    }

    private void WebBrowserThread()
    {
        WebBrowser wb = new WebBrowser();
        wb.Navigate(URL);

        wb.DocumentCompleted +=
            new WebBrowserDocumentCompletedEventHandler(
                wb_DocumentCompleted);

        while (wb.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        //Added this line, because the final HTML takes a while to show up
        GeneratedSource = wb.Document.Body.InnerHtml;

        wb.Dispose();
    }`enter code here`

    private void wb_DocumentCompleted(object sender,
        WebBrowserDocumentCompletedEventArgs e)
    {
        WebBrowser wb = (WebBrowser)sender;
        GeneratedSource = wb.Document.Body.InnerHtml;
    }
c# web-scraping
2个回答
1
投票

使用浏览器开发工具的“网络”选项卡查看他们调用的REST端点以获取数据。然后,而不是刮取html直接调用端点并获取数据。


0
投票

您可以尝试设置延迟事件/计时器以检查页面是否有新数据/ html的可用性。然后使用你自己编写的函数,就像你有wb_DocumentCompleted一样。不是非常有效,但非常准确。祝好运!..

protected System.Timers.Timer MonitorTimer = new System.Timers.Timer();
public void Initialize()
{
    MonitorTimer.Elapsed += new ElapsedEventHandler(UpdateEvent);
    MonitorTimer.Interval = 1000;
    MonitorTimer.Enabled = true;
}
protected object TimerLock = new object();
public void UpdateEvent(object source, ElapsedEventArgs e)
{
    lock (TimerLock)
    {
        doc = (mshtml.HTMLDocument)wbProfile.Document;
        // What you are looking for that only appears later. -->
        if(doc.body.innerHTML.toString().IndexOf("foo") != -1) 
        {
            // Do something useful..
        }
    }
}
© www.soinside.com 2019 - 2024. All rights reserved.