需要使用rvest来抓取动态内容

Question

我必须从名为 Unicorn Auctions 的拍卖网站上删除数据。

enter image description here

当我尝试使用 rvest 执行此操作时，我可以获得的所有内容都是拍卖标题和 URL，但我还需要它的开始日期和结束日期。当我尝试查找它的 CSS 类时，我发现的只是以下几行代码：

enter image description here

我准备使用 RSelenium 来抓取它，正如我在 Stack Overflow 中找到的那样。但我的老板希望它只用 rvest 制作。他说这是可能的，但我找不到任何有用的 YouTube 视频或文章。

我不想让任何人给我解决方案，我只需要一些帮助！

Answer 1

首先从页面源和/或 brwoser 开发工具中的网络响应搜索其中一些值，例如“2023”将引导您到达正确的位置。

页面视图变量嵌入在

<script>...</script>

元素之一中，如下所示：

<script type="text/javascript">
    viewVars = {"escaper":{},...,"auctions": {"result_page":[{...,"time_start": "2023-11-06T01:00:00Z",...};
</script>

该 Javascript 表达式只是一个单独的赋值，当从开头 (

viewVars = {

) 和结尾 (

;

) 删除一些位时，生成的字符串可以解析为 JSON：

library(rvest)
library(stringr)
library(dplyr)

upcominng <- 
  read_html("https://bid.unicornauctions.com/") |>
  html_element(xpath = "//script[contains(text(),'viewVars =')]") |>
  html_text() |>
   # remove few bits from javascript to to make parseble as JSON
  str_remove("^\\s+viewVars =") |>
  str_remove(";\\s+$") |>
  jsonlite::fromJSON() |>
   # extract results_page from the list
  purrr::pluck("upcomingAuctions", "result_page") |>
  as_tibble()

select(upcominng, title, contains("time")) |> glimpse()
#> Rows: 4
#> Columns: 9
#> $ title                    <chr> "November 'No Reserves' Unicorn Auction 2023"…
#> $ time_start               <chr> "2023-11-06T01:00:00Z", "2023-11-13T01:00:00Z…
#> $ time_start_live_auction  <lgl> NA, NA, NA, NA
#> $ time_start_proxy_bidding <lgl> NA, NA, NA, NA
#> $ timezone                 <chr> "America/Chicago", "America/Chicago", "Americ…
#> $ effective_end_time       <chr> "2023-11-13T00:00:00Z", "2023-11-20T00:00:00Z…
#> $ extended_end_time        <lgl> NA, NA, NA, NA
#> $ realtime_server_url      <lgl> NA, NA, NA, NA
#> $ is_times_the_money       <lgl> FALSE, FALSE, FALSE, FALSE

创建于 2023 年 11 月 12 日，使用 ^{reprex v2.0.2}

需要使用rvest来抓取动态内容

问题描述投票：0回答：1

1个回答

最新问题

需要使用rvest来抓取动态内容

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1