我想在网上抓取一些数据,但我得到以下错误信息 Error in html_table(nodes_wp)[[1]] : subscript out of bounds
运行时
###Loading packages###
library(stringr) # build the URL
library(RCurl)
library(haven)
library(readr)
library(plyr)
library(magrittr)
library("rvest")
library("tictoc")
###Web scrapping###
TABLE_BIG=as.data.frame(0)
tic()
for(nr in 1:203540){
link1=paste0("https://ted.europa.eu/udl?uri=TED:NOTICE:", nr, "-2020:DATA:EN:HTML&src=0&tabId=3")
webpage=read_html(link1)
#html info for the table
nodes_wp=html_nodes(webpage, "div#main.container-fluid div.row div#middle-column.col-md-9.col-md-push-3.col-sm-8.col-sm-push-4 div.main-container div.container-fluid div.row div.col-sm-12 div#noticeDisplayFrame.documentDiv.noBg.overflow-dashboard div#mainContent div#docContent table.data")
rs=html_nodes(nodes_wp, "tr")
tab=html_table(nodes_wp)[[1]]
tab_transp=as.data.frame(t(tab$X3))
names(tab_transp)=tab$X1
tab_transp$ID=paste0(nr,"-2020")
#STORE INFO
TABLE_BIG=rbind.fill(TABLE_BIG,tab_transp )
#count time
if(nr%in%seq(5,300000, by=500))
toc()
tic() } #ending loop
toc()
###Exporting to CSV###
`write_csv(TABLE_BIG, "C:TED_202001-202004")`
由于对R不熟悉,我不明白如何解决这个问题。
谁能给点建议?
这个问题是由空节点引起的,因为空节点不能被subsettet,从而抛出错误。如果你不需要跟踪哪些节点是空的,你可以只添加一个 "空节点"。if
条件,以检查是否有 rs
变量为空,如果为空,则中断循环,开始下一次迭代。下面的代码就是这样做的。
TABLE_BIG=as.data.frame(0)
tic()
for(nr in 1:203540){
link1=paste0("https://ted.europa.eu/udl?uri=TED:NOTICE:", nr, "-2020:DATA:EN:HTML&src=0&tabId=3")
webpage=read_html(link1)
#html info for the table
nodes_wp=html_nodes(webpage, "div#main.container-fluid div.row div#middle-column.col-md-9.col-md-push-3.col-sm-8.col-sm-push-4 div.main-container div.container-fluid div.row div.col-sm-12 div#noticeDisplayFrame.documentDiv.noBg.overflow-dashboard div#mainContent div#docContent table.data")
rs=html_nodes(nodes_wp, "tr")
if (length(rs) == 0) next #start with next iteration if rs is empty
tab=html_table(nodes_wp)[[1]]
tab_transp=as.data.frame(t(tab$X3))
names(tab_transp)=tab$X1
tab_transp$ID=paste0(nr,"-2020")
#STORE INFO
if (nr == 282) print(nr)
TABLE_BIG=rbind.fill(TABLE_BIG,tab_transp )
#count time
if(nr%in%seq(5,300000, by=500))
toc()
tic() } #ending loop
toc()
这能解决你的问题吗?