我正在从数据集中抓取不同名称的杂草变种,并不断获得
character(0)
。我正在使用 SelectorGadget 工具来整理变体的名称。所有名字都包括在内,没有NA
s.
library(dplyr)
library(rvest)
weed_data<-read_html("https://www.kaggle.com/datasets/corykjar/leafly-cannabis-strains-dataset?resource=download")
strain_name<-weed_data%>%
html_nodes(".hdFafr div:nth-child(2)")%>%
html_text()
strain_name
正如评论中所建议的那样,创建一个帐户并下载数据集可能是最简单的,但如果你想抓取它,这里有一种方法可以做到这一点。
看起来问题是表格内容是在脚本运行后传递的,因此
rvest
看到的网页与用户导航到该页面时看到的页面不同。克服这个问题的一种可能方法是使用 RSelenium
使网络浏览器自动导航到页面,等待脚本传递表格,然后读取 html 并提取我们想要的节点。
还有一个问题是表格只在用户向下滚动时加载,所以我不得不添加一个 while 循环来滚动直到到达页面底部。
举个例子:
# load libraries
library(RSelenium)
library(rvest)
library(magrittr)
# define target url
url <- "https://www.kaggle.com/datasets/corykjar/leafly-cannabis-strains-dataset?resource=download"
# start RSelenium ------------------------------------------------------------
rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]
# open the remote driver-------------------------------------------------------
remDr$open()
# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)
# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>%
read_html()
# figure out how many records there are:
number_of_total_records <- page_html %>%
html_nodes(".sc-cmwCue") %>%
html_text() %>%
as.numeric()
# Another problem is that the whole table doesn't load
# all at once, it loads dynamically as the user scrolls
# down. So we're going to replicate that using a while
# loop
# define a variable for the number of records we've pulled
number_of_pulled_records <- 0
# now we execute the code inside the while loop
# until the number of pulled records is equal to the number
# of total records
while (number_of_pulled_records < number_of_total_records) {
# first we find the element that corresponds to the table
webElem <- remDr$findElement("css selector", ".sc-cXzqcO")
# then we tell RSelenium to scroll to the bottom of that element
webElem$sendKeysToElement(list(key = "end"))
# unfortnately that only scrolls to the bottom of the
# current table and doesn't count the rows that just loaded
# as the bottom of the table, so we have to keep scrolling
# the while loop is how we decide when to stop scrolling
# pull the new html
page_html <- remDr$getPageSource()[[1]] %>%
read_html()
# find the nodes we care about
# the first node is the class for each row
# then pull the child node corresponding to the second column or the strain name
pulled_results <- page_html %>%
html_nodes("span.sc-cCYyox") %>%
html_nodes("div:nth-child(1) > div:nth-child(2)") %>%
html_text()
# now we see how many results we pulled
# if the number is equal to the total number
# we stop pulling, otherwise the while loop keeps
# scrolling down the page
number_of_pulled_records <- length(pulled_results)
}
> head(pulled_results, 30)
[1] "Ice Cream Cake" "Gelato" "Blue Dream" "Sour Diesel" "Dosidos"
[6] "Apple Fritter" "GSC" "Zkittlez" "OG Kush" "Mac 1"
[11] "Biscotti" "Purple Punch" "White Widow" "Jack Herer" "Mimosa"
[16] "Sherbert" "Pineapple Express" "Cereal Milk" "GMO Cookies" "Durban Poison"
[21] "White Runtz" "Peanut Butter Breath" "Kush Mints" "Northern Lights" "Green Crack"
[26] "Gushers" "MAC" "Slurricane" "Sundae Driver" "Pink Runtz"
> tail(pulled_results, 30)
[1] "Slice of Heaven" "Brian Berry Citrus Blend" "Rigger Kush" "White Empress"
[5] "Crosswalker" "Golden Calyx" "Altoyd" "White Master"
[9] "Ozma" "Short and Sweet" "Sergerbloom Haze" "Somaui"
[13] "Diabla" "Afghooey" "Medikit" "Sweet Nina"
[17] "Beckwourth Bud" "Crockett’s Sour Tangie" "Crockett’s Sour Tangie" "Crockett’s Sour Tangie"
[21] "Crockett’s Sour Tangie" "Crockett’s Sour Tangie" "Crockett’s Sour Tangie" "Crockett’s Sour Tangie"
[25] "Crockett’s Sour Tangie" "Crockett’s Sour Tangie" "Crockett’s Sour Tangie" "Crockett’s Sour Tangie"
[29] "Crockett’s Sour Tangie" "Crockett’s Sour Tangie"
> length(pulled_results)
[1] 5120
> length(unique(pulled_results))
[1] 5106
看起来最后的结果重复了几次,但这应该很容易清理。
Kaggle 提供 API 来访问数据集,但您需要创建一个帐户来获取您的访问令牌。然后可以从
https://www.kaggle.com/me/account下载带有令牌的
kaggle.json
,访问数据集可能看起来像这样:
library(httr2)
kaggle_token <- jsonlite::read_json("kaggle.json")
request("https://www.kaggle.com/api/v1") %>%
req_url_path_append("datasets/download/corykjar/leafly-cannabis-strains-dataset/strains_cleaned.csv") %>%
req_auth_basic(kaggle_token$username, kaggle_token$key) %>%
req_perform() %>%
resp_body_string() %>%
readr::read_csv()
#> # A tibble: 5,120 × 10
#> ...1 Name Type Alias Rating Num_Reviews `THC%` Other_Cannabinoids
#> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 0 Ice Cream Cake Indi… <NA> 4.6 1039 THC 2… CBG 1%
#> 2 1 Gelato Hybr… aka … 4.6 2219 THC 1… CBD 0%
#> 3 2 Blue Dream Hybr… <NA> 4.3 14300 THC 1… CBD 0%
#> 4 3 Sour Diesel Sati… aka … 4.3 8264 THC 1… CBD 0%
#> 5 4 Dosidos Hybr… aka … 4.6 1073 THC 2… CBG 1%
#> 6 5 Apple Fritter Hybr… <NA> 4.5 346 THC 2… CBD 0%
#> 7 6 GSC Hybr… aka … 4.4 7409 THC 1… CBG 1%
#> 8 7 Zkittlez Indi… aka … 4.5 857 THC 2… CBG 1%
#> 9 8 OG Kush Hybr… aka … 4.3 5476 THC 1… CBD 0%
#> 10 9 Mac 1 Hybr… aka … 4.7 353 THC 2… CBG 1%
#> # ℹ 5,110 more rows
#> # ℹ 2 more variables: Main_Effect <chr>, Terpene <chr>
还有 kaggler 包和使用官方 Python API 包的选项 创建于 2023-04-03 与 reprex v2.0.2