在网页抓取时,标签出界错误

问题描述 投票:1回答:1

我想在网上抓取一些数据,但我得到以下错误信息 Error in html_table(nodes_wp)[[1]] : subscript out of bounds 运行时

###Loading packages###

library(stringr) # build the URL

library(RCurl)

library(haven)

library(readr)

library(plyr)

library(magrittr)

library("rvest")

library("tictoc")

###Web scrapping###

TABLE_BIG=as.data.frame(0)

tic()

for(nr in 1:203540){
     link1=paste0("https://ted.europa.eu/udl?uri=TED:NOTICE:", nr, "-2020:DATA:EN:HTML&src=0&tabId=3")
     webpage=read_html(link1)
     #html info for the table
     nodes_wp=html_nodes(webpage, "div#main.container-fluid div.row div#middle-column.col-md-9.col-md-push-3.col-sm-8.col-sm-push-4 div.main-container div.container-fluid div.row div.col-sm-12 div#noticeDisplayFrame.documentDiv.noBg.overflow-dashboard div#mainContent div#docContent table.data")  
    rs=html_nodes(nodes_wp, "tr")
     tab=html_table(nodes_wp)[[1]]
     tab_transp=as.data.frame(t(tab$X3))
     names(tab_transp)=tab$X1
     tab_transp$ID=paste0(nr,"-2020")
     #STORE INFO
     TABLE_BIG=rbind.fill(TABLE_BIG,tab_transp )
     #count time
     if(nr%in%seq(5,300000, by=500))

    toc()
     tic()    } #ending loop

toc()

###Exporting to CSV###

`write_csv(TABLE_BIG, "C:TED_202001-202004")`

由于对R不熟悉,我不明白如何解决这个问题。

谁能给点建议?

r rvest
1个回答
0
投票

这个问题是由空节点引起的,因为空节点不能被subsettet,从而抛出错误。如果你不需要跟踪哪些节点是空的,你可以只添加一个 "空节点"。if 条件,以检查是否有 rs 变量为空,如果为空,则中断循环,开始下一次迭代。下面的代码就是这样做的。

TABLE_BIG=as.data.frame(0)

tic()

for(nr in 1:203540){
  link1=paste0("https://ted.europa.eu/udl?uri=TED:NOTICE:", nr, "-2020:DATA:EN:HTML&src=0&tabId=3")
  webpage=read_html(link1)
  #html info for the table
  nodes_wp=html_nodes(webpage, "div#main.container-fluid div.row div#middle-column.col-md-9.col-md-push-3.col-sm-8.col-sm-push-4 div.main-container div.container-fluid div.row div.col-sm-12 div#noticeDisplayFrame.documentDiv.noBg.overflow-dashboard div#mainContent div#docContent table.data")  
  rs=html_nodes(nodes_wp, "tr")
  if (length(rs) == 0) next #start with next iteration if rs is empty
  tab=html_table(nodes_wp)[[1]]
  tab_transp=as.data.frame(t(tab$X3))
  names(tab_transp)=tab$X1
  tab_transp$ID=paste0(nr,"-2020")
  #STORE INFO
  if (nr == 282) print(nr)
  TABLE_BIG=rbind.fill(TABLE_BIG,tab_transp )
  #count time
  if(nr%in%seq(5,300000, by=500))

    toc()
  tic()    } #ending loop

toc()

这能解决你的问题吗?

© www.soinside.com 2019 - 2024. All rights reserved.