使用rvest包获取var下指定的内容

问题描述 投票:0回答:1

我想从链接中提取以下信息https://www.betashares.com.au/fund/high-interest-cash-etf/

我写了以下代码:

link <- "https://www.betashares.com.au/fund/high-interest-cash-etf"
read_html(link) %>% 
  html_nodes('div') %>% 
  html_nodes('script') %>%
  .[5] %>%
  html_text() -> data

当我尝试使用与这里类似的东西时:在 R 中从 html 中提取声明的变量 as

library(V8)
ctx <- v8()
ctx$eval(data)
ctx$get("navdata")

我收到错误。我们可以通过“;”来进行字符串分割并为 和 做一些清洁工作 ,但是有没有一种优雅的方法来处理这个问题?

r web-scraping rvest
1个回答
0
投票

这是一个有点重的 js 块,具有外部依赖项(例如 anychart)。由于您只需要一行,因此您可以通过硬编码索引或通过定位

var navdata
来提取它。从那里您可以使用
V8
:

评估单个赋值表达式
library(dplyr, warn.conflicts = FALSE)
library(rvest)
library(V8)
#> Using V8 engine 9.1.269.38
library(stringr)

link <- "https://www.betashares.com.au/fund/high-interest-cash-etf"

navdata_js <- 
  read_html(link) %>% 
  html_element("#performance > div:nth-child(5) > script:nth-child(8)") %>% 
  html_text() %>% 
  # read only a single line, the 5th
  readr::read_lines(skip = 4, n_max = 1)

# start:
str_trunc(navdata_js, 80) %>% str_view()
#> [1] │ {\t\t\t\t}var navdata = [["2012-03-06",50,100],["2012-03-07",49.9998,99.9995],["201...
# end:
str_trunc(navdata_js, 80, side = "left") %>% str_view()
#> [1] │ ...32,130.2664],["2023-09-21",50.189,130.2813],["2023-09-22",50.1947,130.2962]];

ctx <- v8()
ctx$eval(navdata_js)
ctx$get("navdata") %>% 
  head()
#>      [,1]         [,2]      [,3]      
#> [1,] "2012-03-06" "50"      "100"     
#> [2,] "2012-03-07" "49.9998" "99.9995" 
#> [3,] "2012-03-08" "50.003"  "100.0061"
#> [4,] "2012-03-09" "50.0099" "100.0198"
#> [5,] "2012-03-12" "50.0235" "100.047" 
#> [6,] "2012-03-13" "50.0271" "100.0542"

或者通过删除前导

var navdata =
和尾随
;
来提取数组字符串并将其解析为 JSON:

str_extract(navdata_js, "(?<=var navdata \\= )[^;]+") %>% 
  jsonlite::parse_json(simplifyVector = T) %>% 
  head()
#>      [,1]         [,2]      [,3]      
#> [1,] "2012-03-06" "50"      "100"     
#> [2,] "2012-03-07" "49.9998" "99.9995" 
#> [3,] "2012-03-08" "50.003"  "100.0061"
#> [4,] "2012-03-09" "50.0099" "100.0198"
#> [5,] "2012-03-12" "50.0235" "100.047" 
#> [6,] "2012-03-13" "50.0271" "100.0542"

创建于 2023-09-22,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.