rvest 按特定类别从网站上抓取所有值

问题描述 投票:0回答:1

我正在尝试从此网站上抓取所有位置号码、街道地址和城市/州/邮政编码。我尝试了几种不同的方法,但没有成功,包括尝试从特定类中提取值。通过深入挖掘源代码,我发现我需要的三个值具有以下类

css-dhj0hp, css-8er82g, css-n2nvu7
。理想情况下,我想要一个包含商店编号、街道地址、城市、州、邮政编码的数据框,但我什至无法返回文本。有人可以帮忙吗?

library(rvest)
library(tidyverse)

url <- "https://locations.wafflehouse.com"

page <- read_html(url)

store_data <- 
  page |> 
  html_nodes("div.css-dhj0hp")
r web-scraping rvest
1个回答
0
投票

/../ 通过挖掘源代码/../

您可能指的是Element Inspector,它恰好是一个完全不同的野兽。它可以让您浏览 DOM 树,动态站点的 DOM 树由 Javascript 渲染或至少进行大量修改,并且可能与

rvest
尝试解析的内容有很大不同。

在实际的页面源中,webapp 数据和位置以 JSON 形式嵌入到

<script id="__NEXT_DATA__" type="application/json"> .. </script>
元素中,我们可以使用
rvest
提取它并使用
jsonlite
解析;位置在结果嵌套列表中更深一点,
props > pageProps > locations

library(rvest)
library(dplyr)
library(tidyr)

url <- "https://locations.wafflehouse.com" 

page <- read_html(url)
store_data <- 
  page |> 
  html_element("script#__NEXT_DATA__") |> 
  html_text() |> 
  jsonlite::fromJSON() |>
  purrr::pluck("props", "pageProps", "locations") |>
  unnest(addressLines) |>
  unnest(custom) |> 
  as_tibble()

glimpse(store_data)
#> Rows: 1,978
#> Columns: 19
#> $ storeCode              <chr> "100", "1000", "1001", "1002", "1003", "1004", …
#> $ businessName           <chr> "Waffle House #100", "Waffle House #1000", "Waf…
#> $ addressLines           <chr> "2842 PANOLA RD", "2840 E. COLLEGE AVE.", "1292…
#> $ city                   <chr> "LITHONIA", "DECATUR", "LOUISVILLE", "NORMAN", …
#> $ state                  <chr> "GA", "GA", "KY", "OK", "MS", "AL", "GA", "MO",…
#> $ country                <chr> "US", "US", "US", "US", "US", "US", "US", "US",…
#> $ operated_by            <chr> "WAFFLE HOUSE, INC", "WAFFLE HOUSE, INC", "FULL…
#> $ online_order_link      <chr> NA, "https://order.wafflehouse.com/menu/waffle-…
#> $ postalCode             <chr> "30058", "30030", "40243", "73072", "39520", "3…
#> $ latitude               <dbl> 33.70471, 33.77522, 38.24359, 35.23244, 30.3132…
#> $ longitude              <dbl> -84.16985, -84.27374, -85.51321, -97.48904, -89…
#> $ phoneNumbers           <list> "(770) 981-1914", "(404) 294-8758", "(502) 244…
#> $ websiteURL             <chr> "https://locations.wafflehouse.com///lithonia-g…
#> $ businessHours          <list> <"00:00", "00:00", "00:00", "00:00", "00:00", …
#> $ specialHours           <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>…
#> $ formattedBusinessHours <list> "Monday - Sunday| 24 hours", "Monday - Sunday|…
#> $ slug                   <chr> "lithonia-ga-100", "decatur-ga-1000", "louisvil…
#> $ localPageUrl           <chr> "/lithonia-ga-100", "/decatur-ga-1000", "/louis…
#> $ `_status`              <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A…

创建于 2023-08-20,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.