当从网站上刮取表格时，R错误HTML是不适用的。

Question

我正试图从一个网站上刮取一个表格。

我使用了下面的代码


library("rvest")

url <- "http://sabap2.birdmap.africa/coverage/pentad/2945_3100"

population <- url %>%
  html() %>%
  html_nodes(xpath='//*[@id="coverage_species"]/div/div/table') %>%
  html_table()

但我得到以下错误。

Warning message:
'html' is deprecated.
Use 'xml2::read_html' instead.
See help("Deprecated")

有谁能建议使用xml2的正确方法？我也不确定我在html_nodes步骤中是否使用了正确的xPath？

我也不确定我是否在html_nodes步骤中使用了正确的xPath？

Answer 1

你要找的表格是在浏览器加载页面后由Javascript动态生成的。你可以使用字符串操作提取相关的Javascript片段，并在R中解析它们来构建表格。

这和你收到的警告无关，它只是告诉你使用 read_html() 而不是 html() 因为 read_html() 是较新的功能，做类似的工作和 html() 正在被淘汰。

url       <- "http://sabap2.birdmap.africa/coverage/pentad/2945_3100"
page      <- httr::content(httr::GET(url), "text")
json      <- strsplit(strsplit(page, "summarydata.addRows[(]")[[1]][2], "[)]")[[1]][1]
df        <- data.frame(rbind(jsonlite::fromJSON(json)), stringsAsFactors = FALSE)
lines     <- strsplit(page, "[\r\n]+")[[1]]
linelist  <- strsplit(grep("summarydata[.]addCol", lines, value = TRUE), "'")
names(df) <- sapply(linelist, `[`, 4)

这就在一个漂亮的数据框架中给出了结果。

df
#>        Year no cards 1 card 2 cards 3 cards 4 or more Pentads covered
#> 1  AllYears        0      0       0       0         1               1
#> 2      2020        0      0       0       1         0               1
#> 3      2019        0      0       0       0         1               1
#> 4      2018        0      0       0       0         1               1
#> 5      2017        0      0       0       0         1               1
#> 6      2016        0      0       0       0         1               1
#> 7      2015        0      0       0       0         1               1
#> 8      2014        0      0       0       0         1               1
#> 9      2013        0      0       0       0         1               1
#> 10     2012        0      0       0       0         1               1
#> 11     2011        0      0       0       0         1               1
#> 12     2010        0      0       0       0         1               1
#> 13     2009        0      0       0       0         1               1
#> 14     2008        0      0       0       0         1               1
#> 15     2007        0      0       0       0         1               1
#>    Pentads in area Total Cards (FP) Total species (FP)
#> 1                1              361                284
#> 2                1                3                 44
#> 3                1               18                158
#> 4                1               21                165
#> 5                1               51                172
#> 6                1               45                198
#> 7                1               25                178
#> 8                1               12                149
#> 9                1               26                165
#> 10               1               34                163
#> 11               1               46                189
#> 12               1               36                181
#> 13               1               22                146
#> 14               1               17                173
#> 15               1                5                131

增编

上位者要求在页面上换一个表来解析。可以用类似的方式，像这样。

species_json <- strsplit(page, "carddataspeciesmonthly[.]addRows[(]")[[1]][2]
species_tab <- jsonlite::fromJSON(strsplit(species_json, "[)];")[[1]][1])
species_df <- as.data.frame(species_tab)
species_cols <- strsplit(page, "carddataspeciesmonthly[.]addColumn[(]")[[1]][-1]
names(species_df) <- sapply(strsplit(species_cols, "'"), `[`, 4)

结果的数据框太大了，无法在这里显示，所以我把它作为一个... ... tibble:

dplyr::as_tibble(species_df)
# A tibble: 284 x 20
   Ref   Common_group Common_species Genus Species Jan   Feb   Mar   Apr   May   Jun   Jul  
   <fct> <fct>        <fct>          <fct> <fct>   <fct> <fct> <fct> <fct> <fct> <fct> <fct>
 1 8     Albatross    Black-browed   Thal~ melano~ 0     0     0     0     0     0     3.2  
 2 1079  Albatross    Indian Yellow~ Thal~ carteri 0     0     0     0     0     0     3.2  
 3 4150  Albatross    Shy            Thal~ cauta   0     0     0     0     0     0     3.2  
 4 622   Apalis       Bar-throated   Apal~ thorac~ 22    31.8  38.9  33.3  30.8  46.7  38.7 
 5 625   Apalis       Yellow-breast~ Apal~ flavida 9.8   9.1   22.2  11.1  12.8  26.7  16.1 
 6 432   Barbet       Acacia Pied    Tric~ leucom~ 0     0     0     0     2.6   0     0    
 7 431   Barbet       Black-collared Lybi~ torqua~ 63.4  77.3  72.2  50    89.7  73.3  67.7 
 8 439   Barbet       Crested        Trac~ vailla~ 14.6  40.9  38.9  16.7  25.6  13.3  9.7  
 9 433   Barbet       White-eared    Stac~ leucot~ 17.1  18.2  11.1  44.4  35.9  40    35.5 
10 672   Batis        Cape           Batis capens~ 2.4   13.6  11.1  5.6   0     0     0    
# ... with 274 more rows, and 8 more variables: Aug <fct>, Sep <fct>, Oct <fct>, Nov <fct>,
#   Dec <fct>, RepRate <fct>, Records <fct>, Cards <fct>

^{创建于2020-05-19 重读包 (v0.3.0)}

当从网站上刮取表格时，R错误HTML是不适用的。

问题描述投票：0回答：1

1个回答

最新问题

当从网站上刮取表格时，R错误HTML是不适用的。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1