使用 rvest 提取经纬度

问题描述 投票:0回答:1

我正在尝试从这里获取所有商店的经纬度 https://www.wellcome.com.hk/sc/our-store

检查时,我可以看到 lat 和 lon 包含在 div 中

library(dplyr)
library(rvest)

url_company <- rvest::read_html("https://www.wellcome.com.hk/en/our-store") 
url_company %>%
 html_elements("div") %>% # extracted all the div tag
 html_elements("p") # extracted all p tag

如何访问 data-lat 和 data-lng 标签?

html r web-scraping rvest
1个回答
0
投票

您可能想检查页面源而不是 Insepctor。或者首先禁用 JavaScript,重新加载,然后检查。您在 Insepctor 中看到的是由 JavaScript 修改的 DOM 树,但在

rvest
中您只能使用实际的页面源。源代码中的同一部分如下所示:

<div class="wellcome_map_shop">
<div class="wellcome_map_filter js-filter"><select aria-label="Location" name="location"></select> <select aria-label="District" name="district"></select></div>

<div class="wellcome_map_loc"><span class="js-loc_on" style="display:none">Location on</span> <span class="js-loc_off">Location permission is denied</span></div>

<div class="wellcome_map_shop_item js-shop_template js-shop_item" style="display:none">
<div class="wellcome_map_shop_detail">

虽然坐标实际上与所有其他地图数据一起存在,嵌入在

<script>
元素之一中:

<script type="text/javascript">
<!--//--><![CDATA[// ><!--

var googleMap        = null;
...
var googleMapData = [
{"name":"Ching Tin","addr":"Shop No. G6, G/F Ching Tin Shopping Centre, Ching Tin Estate, Tuen Mun, N.T.","name_zh":"菁田","addr_zh":"屯門菁田邨菁田購物中心地下G6室","tel":"2317 6863","time":"08:00-22:00","time_zh":"08:00-22:00","region":"32","district":"24","lat":22.4123694,"lng":113.9714609},
{"name":"Lei King Wan","addr":"Shop GC19-21. Site C. Lei King Wan, 35 Tai Hong Street, Sai Wan Ho, Hong Kong","name_zh":"鯉景灣","addr_zh":"香港西灣河太康街35號鯉景灣C 期GC19-21號舖","tel":"2815 6029","time":"07:30-22:00","time_zh":"07:30-22:00","region":"30","district":"161","lat":22.2851255,"lng":114.2233381},
...

我们可以使用

rvest
提取元素内容并处理结果字符串以获得纬度/经度值。或者更聪明一点,只应用最少的处理来获得
var googleMapData =
赋值的右侧,然后可以使用
jsonlite
将其解析为 JSON,以获得一个不错的 data.frame。或者..如果我们觉得超级懒,我们可以将所有
<script>
元素内容扔到V8(一个JavaScript引擎)中,祈祷并获取
googleMapData
js变量的值:

library(rvest)
library(dplyr)
library(V8)
#> Using V8 engine 11.8.172.13

ctx <- v8()

# load page, use xpath to extract correct script element, 
# the one conatining text "googleMapData", get text and evaluate as JavaScript
read_html("https://www.wellcome.com.hk/en/our-store") %>% 
  html_elements(xpath =  "//script[contains(text(),'googleMapData')]") %>% 
  html_text() %>% 
  ctx$eval() 

# we only care about `var googleMapData = [...]` assignment, rest of the script
# might as well fail; 
# extract googleMapData value from v8
ctx$get("googleMapData") %>% 
  as_tibble() %>% 
  # fromat lat/lon columns
  mutate(across(where(is.numeric), ~ tibble::num(.x, digits = 2))) %>% 
  select(name, addr, lat, lng)
#> # A tibble: 279 × 4
#>    name             addr                                              lat    lng
#>    <chr>            <chr>                                           <num> <num:>
#>  1 Ching Tin        Shop No. G6, G/F Ching Tin Shopping Centre, Ch… 22.41 113.97
#>  2 Lei King Wan     Shop GC19-21. Site C. Lei King Wan, 35 Tai Hon… 22.29 114.22
#>  3 Garden Estate    Shop No. 15-18, G/F Lotus Tower 3, 297 Kwun To… 22.32 114.22
#>  4 Tsuen Wan        57-61, Lo Tak Court, G/F, Tsuen Wan, NT         22.37 114.12
#>  5 Pak Tin Estate   Shop LG201, Lower Ground Level 2, Pak Tin Comm… 22.34 114.17
#>  6 Tak Bo Garden    Shop 138, G/F, TBG Mall, Tak Bo Garden, No. 3 … 22.33 114.21
#>  7 Dor Hei Building Shop No.2-3, G/F, Dor Hei Building, Nos.9-17 T… 22.32 114.22
#>  8 Chevalier House  Shop C and Portion of Shop D on Ground Floor, … 22.30 114.18
#>  9 Shan King 2      Stall No. T-SK73, G/F, Shan King Shopping Cent… 22.40 113.97
#> 10 Shek Mun         Shop No. 28, G/F, 1 On Ping Street, Shatin, NT  22.39 114.21
#> # ℹ 269 more rows

创建于 2023-11-24,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.