如何防止在使用 rvest 进行网页抓取时获取字符(0)

问题描述 投票:0回答:2

我正在从数据集中抓取不同名称的杂草变种,并不断获得

character(0)
。我正在使用 SelectorGadget 工具来整理变体的名称。所有名字都包括在内,没有
NA
s.

library(dplyr)
library(rvest)
weed_data<-read_html("https://www.kaggle.com/datasets/corykjar/leafly-cannabis-strains-dataset?resource=download")
strain_name<-weed_data%>%
html_nodes(".hdFafr div:nth-child(2)")%>%
html_text()
strain_name
r web-scraping rvest
2个回答
1
投票

正如评论中所建议的那样,创建一个帐户并下载数据集可能是最简单的,但如果你想抓取它,这里有一种方法可以做到这一点。

看起来问题是表格内容是在脚本运行后传递的,因此

rvest
看到的网页与用户导航到该页面时看到的页面不同。克服这个问题的一种可能方法是使用
RSelenium
使网络浏览器自动导航到页面,等待脚本传递表格,然后读取 html 并提取我们想要的节点。

还有一个问题是表格只在用户向下滚动时加载,所以我不得不添加一个 while 循环来滚动直到到达页面底部。

举个例子:


# load libraries
library(RSelenium)
library(rvest)
library(magrittr)

# define target url
url <- "https://www.kaggle.com/datasets/corykjar/leafly-cannabis-strains-dataset?resource=download"


# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]

# open the remote driver-------------------------------------------------------
remDr$open()

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)

# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>% 
  read_html()



# figure out how many records there are: 

number_of_total_records <- page_html %>% 
                          html_nodes(".sc-cmwCue") %>% 
                          html_text() %>% 
                          as.numeric()


# Another problem is that the whole table doesn't load 
# all at once, it loads dynamically as the user scrolls
# down. So we're going to replicate that using a while
# loop

# define a variable for the number of records we've pulled
number_of_pulled_records <- 0


# now we execute the code inside the while loop
# until the number of pulled records is equal to the number
# of total records
while (number_of_pulled_records  < number_of_total_records) {
  
  
  # first we find the element that corresponds to the table
  webElem <- remDr$findElement("css selector", ".sc-cXzqcO")
  
  # then we tell RSelenium to scroll to the bottom of that element
  webElem$sendKeysToElement(list(key = "end"))
  
  # unfortnately that only scrolls to the bottom of the 
  # current table and doesn't count the rows that just loaded
  # as the bottom of the table, so we have to keep scrolling
  # the while loop is how we decide when to stop scrolling
  
  # pull the new html
  page_html <- remDr$getPageSource()[[1]] %>% 
    read_html()
  

  # find the nodes we care about
  # the first node is the class for each row
  # then pull the child node corresponding to the second column or the strain name
  
  pulled_results <- page_html %>% 
    html_nodes("span.sc-cCYyox") %>% 
    html_nodes("div:nth-child(1) > div:nth-child(2)") %>%
    html_text()
  
  
  
  # now we see how many results we pulled
  # if the number is equal to the total number
  # we stop pulling, otherwise the while loop keeps
  # scrolling down the page
  
  
  number_of_pulled_records <- length(pulled_results)
  
  
}


结果

> head(pulled_results, 30)
 [1] "Ice Cream Cake"       "Gelato"               "Blue Dream"           "Sour Diesel"          "Dosidos"             
 [6] "Apple Fritter"        "GSC"                  "Zkittlez"             "OG Kush"              "Mac 1"               
[11] "Biscotti"             "Purple Punch"         "White Widow"          "Jack Herer"           "Mimosa"              
[16] "Sherbert"             "Pineapple Express"    "Cereal Milk"          "GMO Cookies"          "Durban Poison"       
[21] "White Runtz"          "Peanut Butter Breath" "Kush Mints"           "Northern Lights"      "Green Crack"         
[26] "Gushers"              "MAC"                  "Slurricane"           "Sundae Driver"        "Pink Runtz"          


> tail(pulled_results, 30)
 [1] "Slice of Heaven"          "Brian Berry Citrus Blend" "Rigger Kush"              "White Empress"           
 [5] "Crosswalker"              "Golden Calyx"             "Altoyd"                   "White Master"            
 [9] "Ozma"                     "Short and Sweet"          "Sergerbloom Haze"         "Somaui"                  
[13] "Diabla"                   "Afghooey"                 "Medikit"                  "Sweet Nina"              
[17] "Beckwourth Bud"           "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"  
[21] "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"  
[25] "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"  
[29] "Crockett’s Sour Tangie"   "Crockett’s Sour Tangie"  


> length(pulled_results)
[1] 5120


> length(unique(pulled_results))
[1] 5106

看起来最后的结果重复了几次,但这应该很容易清理。


0
投票

Kaggle 提供 API 来访问数据集,但您需要创建一个帐户来获取您的访问令牌。然后可以从

https://www.kaggle.com/me/account
下载带有令牌的kaggle.json,访问数据集可能看起来像这样:

library(httr2)
kaggle_token <- jsonlite::read_json("kaggle.json")
request("https://www.kaggle.com/api/v1") %>% 
  req_url_path_append("datasets/download/corykjar/leafly-cannabis-strains-dataset/strains_cleaned.csv") %>% 
  req_auth_basic(kaggle_token$username, kaggle_token$key) %>% 
  req_perform() %>% 
  resp_body_string() %>% 
  readr::read_csv()

#> # A tibble: 5,120 × 10
#>     ...1 Name           Type  Alias Rating Num_Reviews `THC%` Other_Cannabinoids
#>    <dbl> <chr>          <chr> <chr>  <dbl>       <dbl> <chr>  <chr>             
#>  1     0 Ice Cream Cake Indi… <NA>     4.6        1039 THC 2… CBG 1%            
#>  2     1 Gelato         Hybr… aka …    4.6        2219 THC 1… CBD 0%            
#>  3     2 Blue Dream     Hybr… <NA>     4.3       14300 THC 1… CBD 0%            
#>  4     3 Sour Diesel    Sati… aka …    4.3        8264 THC 1… CBD 0%            
#>  5     4 Dosidos        Hybr… aka …    4.6        1073 THC 2… CBG 1%            
#>  6     5 Apple Fritter  Hybr… <NA>     4.5         346 THC 2… CBD 0%            
#>  7     6 GSC            Hybr… aka …    4.4        7409 THC 1… CBG 1%            
#>  8     7 Zkittlez       Indi… aka …    4.5         857 THC 2… CBG 1%            
#>  9     8 OG Kush        Hybr… aka …    4.3        5476 THC 1… CBD 0%            
#> 10     9 Mac 1          Hybr… aka …    4.7         353 THC 2… CBG 1%            
#> # ℹ 5,110 more rows
#> # ℹ 2 more variables: Main_Effect <chr>, Terpene <chr>

还有 kaggler 包和使用官方 Python API 包的选项 创建于 2023-04-03 与 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.