我想从bet365.com获取数据,但问题是当我下载页面源时,页面源不包含该数据。当我搜索时,在单页面应用程序中,所有内容都不会立即加载。我尝试了下面的代码但是没有获得所需的数据。有人可以帮忙吗?
public string GetGeneratedHTML(string url)
{
URL = url;
Thread t = new Thread(new ThreadStart(WebBrowserThread));
t.SetApartmentState(ApartmentState.STA);
t.Start();
t.Join();
return GeneratedSource;
}
private void WebBrowserThread()
{
WebBrowser wb = new WebBrowser();
wb.Navigate(URL);
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(
wb_DocumentCompleted);
while (wb.ReadyState != WebBrowserReadyState.Complete)
Application.DoEvents();
//Added this line, because the final HTML takes a while to show up
GeneratedSource = wb.Document.Body.InnerHtml;
wb.Dispose();
}`enter code here`
private void wb_DocumentCompleted(object sender,
WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
GeneratedSource = wb.Document.Body.InnerHtml;
}
使用浏览器开发工具的“网络”选项卡查看他们调用的REST端点以获取数据。然后,而不是刮取html直接调用端点并获取数据。
您可以尝试设置延迟事件/计时器以检查页面是否有新数据/ html的可用性。然后使用你自己编写的函数,就像你有wb_DocumentCompleted一样。不是非常有效,但非常准确。祝好运!..
protected System.Timers.Timer MonitorTimer = new System.Timers.Timer();
public void Initialize()
{
MonitorTimer.Elapsed += new ElapsedEventHandler(UpdateEvent);
MonitorTimer.Interval = 1000;
MonitorTimer.Enabled = true;
}
protected object TimerLock = new object();
public void UpdateEvent(object source, ElapsedEventArgs e)
{
lock (TimerLock)
{
doc = (mshtml.HTMLDocument)wbProfile.Document;
// What you are looking for that only appears later. -->
if(doc.body.innerHTML.toString().IndexOf("foo") != -1)
{
// Do something useful..
}
}
}