我想从链接中提取以下信息https://www.betashares.com.au/fund/high-interest-cash-etf/
我写了以下代码:
link <- "https://www.betashares.com.au/fund/high-interest-cash-etf"
read_html(link) %>%
html_nodes('div') %>%
html_nodes('script') %>%
.[5] %>%
html_text() -> data
当我尝试使用与这里类似的东西时:在 R 中从 html 中提取声明的变量 as
library(V8)
ctx <- v8()
ctx$eval(data)
ctx$get("navdata")
我收到错误。我们可以通过“;”来进行字符串分割并为 和 做一些清洁工作 ,但是有没有一种优雅的方法来处理这个问题?
这是一个有点重的 js 块,具有外部依赖项(例如 anychart)。由于您只需要一行,因此您可以通过硬编码索引或通过定位
var navdata
来提取它。从那里您可以使用 V8
: 评估单个赋值表达式
library(dplyr, warn.conflicts = FALSE)
library(rvest)
library(V8)
#> Using V8 engine 9.1.269.38
library(stringr)
link <- "https://www.betashares.com.au/fund/high-interest-cash-etf"
navdata_js <-
read_html(link) %>%
html_element("#performance > div:nth-child(5) > script:nth-child(8)") %>%
html_text() %>%
# read only a single line, the 5th
readr::read_lines(skip = 4, n_max = 1)
# start:
str_trunc(navdata_js, 80) %>% str_view()
#> [1] │ {\t\t\t\t}var navdata = [["2012-03-06",50,100],["2012-03-07",49.9998,99.9995],["201...
# end:
str_trunc(navdata_js, 80, side = "left") %>% str_view()
#> [1] │ ...32,130.2664],["2023-09-21",50.189,130.2813],["2023-09-22",50.1947,130.2962]];
ctx <- v8()
ctx$eval(navdata_js)
ctx$get("navdata") %>%
head()
#> [,1] [,2] [,3]
#> [1,] "2012-03-06" "50" "100"
#> [2,] "2012-03-07" "49.9998" "99.9995"
#> [3,] "2012-03-08" "50.003" "100.0061"
#> [4,] "2012-03-09" "50.0099" "100.0198"
#> [5,] "2012-03-12" "50.0235" "100.047"
#> [6,] "2012-03-13" "50.0271" "100.0542"
或者通过删除前导
var navdata =
和尾随 ;
来提取数组字符串并将其解析为 JSON:
str_extract(navdata_js, "(?<=var navdata \\= )[^;]+") %>%
jsonlite::parse_json(simplifyVector = T) %>%
head()
#> [,1] [,2] [,3]
#> [1,] "2012-03-06" "50" "100"
#> [2,] "2012-03-07" "49.9998" "99.9995"
#> [3,] "2012-03-08" "50.003" "100.0061"
#> [4,] "2012-03-09" "50.0099" "100.0198"
#> [5,] "2012-03-12" "50.0235" "100.047"
#> [6,] "2012-03-13" "50.0271" "100.0542"
创建于 2023-09-22,使用 reprex v2.0.2