我正在遵循本教程RSelenium and scraping,在我测试navigate_click()函数之前,所有方法都工作正常。 (set_names与教程不同,因为我的源网站不同。)
navigate_click <- function() {
webElem <- remDr$findElement(using = "class name",
"google-visualization-table-div-page")
Sys.sleep(0.5)
webElem$clickElement()
remDr$getPageSource()[[1]] %>%
read_xml() %>%
xml_ns_strip() %>%
xml_find_all(xpath = '//td') %>%
xml_text() %>%
set_names(c("PublicationTitle", "County", "Place_of_Publication", "Library")) %>%
as.list() %>% as_tibble()
}
它返回错误:
read_xml.raw(charToRaw(enc2utf8(x)),“ UTF-8”,...,as_html = as_html,中的错误:xmlParseEntityRef:无名称[68]
这里是回溯...
> navigate_click()
Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, :
xmlParseEntityRef: no name [68]
11. read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,
options = options)
10. read_xml.character(.)
9. read_xml(.)
8. function_list[[i]](value)
7. freduce(value, `_function_list`)
6. `_fseq`(`_lhs`)
5. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
3. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
2. remDr$getPageSource()[[1]] %>% read_xml() %>% xml_ns_strip() %>%
xml_find_all(xpath = "//td") %>% xml_text() %>% set_names(c("PublicationTitle",
"County", "Place_of_Publication", "Library")) %>% as.list() %>%
as_tibble()
1. navigate_click()
我发现您正在寻找的博客有些令人困惑;我不清楚navigate_click
函数如何工作,因为它需要HTML源并在其上调用read_xml()
。尽管某些HTML页面可能符合严格的XML格式,但大多数不是格式正确的XML。在这些情况下,read_xml
将引发错误。
[幸运的是,xml2
程序包还具有read_html
函数,该函数将解析您的页面而没有任何问题。但是,这将无法修复您的功能,因为当您选择td
元素并获取其文本内容时,您将获得单个字符向量,因此无法对其应用set_names
。
无论如何,rvest
包都使从解析的html中读取表变得更加容易。
假设您已经按照示例完成了install.packages("rvest")
并创建了remDr
,则应该可以执行以下操作:
remDr$navigate("https://view-awesome-table.com/-Lz90gtPDhIyGUzmdMrE/view")
webElem <- remDr$findElement(using = "class name", "google-visualization-table-div-page")
Sys.sleep(0.5)
webElem$clickElement()
remDr$getPageSource()[[1]] %>%
read_html(x) %>%
xml_find_all(xpath = "//*[@class = 'google-visualization-table-table']") %>%
rvest::html_table() %>%
`[[`(1) %>%
`[`(c(1, 2, 3, 7)) %>%
as_tibble()
#> # A tibble: 15 x 4
#> PublicationTitle County Place_of_Publicati~ Library
#> <chr> <chr> <chr> <chr>
#> 1 ALFRETON AND DISTRICT ADVERT~ Derbyshi~ "Alfreton and Ripl~ British Library
#> 2 ALFRETON AND DISTRICT ADVERT~ Derbyshi~ "Alfreton and Ripl~ Derbyshire: County Ha~
#> 3 ALFRETON AND DISTRICT COMING~ Derbyshi~ "Alfreton" British Library
#> 4 ALFRETON AND DISTRICT COMING~ Derbyshi~ "Alfreton" Derbyshire: County Ha~
#> 5 ALFRETON AND DISTRICT ECHO Derbyshi~ "Alfreton" British Library
#> 6 ALFRETON AND DISTRICT ECHO Derbyshi~ "Alfreton" Derbyshire: County Ha~
#> 7 ALFRETON AND RIPLEY ECHO Derbyshi~ "Chesterfield" British Library
#> 8 ALFRETON AND RIPLEY ECHO Derbyshi~ "Chesterfield" Derbyshire: Alfreton
#> 9 ALFRETON ARGUS Derbyshi~ "Alfreton" British Library
#> 10 ALFRETON ARGUS Derbyshi~ "Alfreton" Derbyshire: County Ha~
#> 11 ALFRETON JOURNAL Derbyshi~ "" British Library
#> 12 ALFRETON JOURNAL Derbyshi~ "" Derbyshire: Alfreton
#> 13 ALFRETON JOURNAL Derbyshi~ "" Derbyshire: County Ha~
#> 14 ALFRETON JOURNAL Derbyshi~ "" Derbyshire: Magic Att~
#> 15 ALFRETON TRADER Derbyshi~ "" British Library