我正在尝试从这里获取所有商店的经纬度 https://www.wellcome.com.hk/sc/our-store
library(dplyr)
library(rvest)
url_company <- rvest::read_html("https://www.wellcome.com.hk/en/our-store")
url_company %>%
html_elements("div") %>% # extracted all the div tag
html_elements("p") # extracted all p tag
如何访问 data-lat 和 data-lng 标签?
您可能想检查页面源而不是 Insepctor。或者首先禁用 JavaScript,重新加载,然后检查。您在 Insepctor 中看到的是由 JavaScript 修改的 DOM 树,但在
rvest
中您只能使用实际的页面源。源代码中的同一部分如下所示:
<div class="wellcome_map_shop">
<div class="wellcome_map_filter js-filter"><select aria-label="Location" name="location"></select> <select aria-label="District" name="district"></select></div>
<div class="wellcome_map_loc"><span class="js-loc_on" style="display:none">Location on</span> <span class="js-loc_off">Location permission is denied</span></div>
<div class="wellcome_map_shop_item js-shop_template js-shop_item" style="display:none">
<div class="wellcome_map_shop_detail">
虽然坐标实际上与所有其他地图数据一起存在,嵌入在
<script>
元素之一中:
<script type="text/javascript">
<!--//--><![CDATA[// ><!--
var googleMap = null;
...
var googleMapData = [
{"name":"Ching Tin","addr":"Shop No. G6, G/F Ching Tin Shopping Centre, Ching Tin Estate, Tuen Mun, N.T.","name_zh":"菁田","addr_zh":"屯門菁田邨菁田購物中心地下G6室","tel":"2317 6863","time":"08:00-22:00","time_zh":"08:00-22:00","region":"32","district":"24","lat":22.4123694,"lng":113.9714609},
{"name":"Lei King Wan","addr":"Shop GC19-21. Site C. Lei King Wan, 35 Tai Hong Street, Sai Wan Ho, Hong Kong","name_zh":"鯉景灣","addr_zh":"香港西灣河太康街35號鯉景灣C 期GC19-21號舖","tel":"2815 6029","time":"07:30-22:00","time_zh":"07:30-22:00","region":"30","district":"161","lat":22.2851255,"lng":114.2233381},
...
我们可以使用
rvest
提取元素内容并处理结果字符串以获得纬度/经度值。或者更聪明一点,只应用最少的处理来获得 var googleMapData =
赋值的右侧,然后可以使用 jsonlite
将其解析为 JSON,以获得一个不错的 data.frame。或者..如果我们觉得超级懒,我们可以将所有<script>
元素内容扔到V8(一个JavaScript引擎)中,祈祷并获取googleMapData
js变量的值:
library(rvest)
library(dplyr)
library(V8)
#> Using V8 engine 11.8.172.13
ctx <- v8()
# load page, use xpath to extract correct script element,
# the one conatining text "googleMapData", get text and evaluate as JavaScript
read_html("https://www.wellcome.com.hk/en/our-store") %>%
html_elements(xpath = "//script[contains(text(),'googleMapData')]") %>%
html_text() %>%
ctx$eval()
# we only care about `var googleMapData = [...]` assignment, rest of the script
# might as well fail;
# extract googleMapData value from v8
ctx$get("googleMapData") %>%
as_tibble() %>%
# fromat lat/lon columns
mutate(across(where(is.numeric), ~ tibble::num(.x, digits = 2))) %>%
select(name, addr, lat, lng)
#> # A tibble: 279 × 4
#> name addr lat lng
#> <chr> <chr> <num> <num:>
#> 1 Ching Tin Shop No. G6, G/F Ching Tin Shopping Centre, Ch… 22.41 113.97
#> 2 Lei King Wan Shop GC19-21. Site C. Lei King Wan, 35 Tai Hon… 22.29 114.22
#> 3 Garden Estate Shop No. 15-18, G/F Lotus Tower 3, 297 Kwun To… 22.32 114.22
#> 4 Tsuen Wan 57-61, Lo Tak Court, G/F, Tsuen Wan, NT 22.37 114.12
#> 5 Pak Tin Estate Shop LG201, Lower Ground Level 2, Pak Tin Comm… 22.34 114.17
#> 6 Tak Bo Garden Shop 138, G/F, TBG Mall, Tak Bo Garden, No. 3 … 22.33 114.21
#> 7 Dor Hei Building Shop No.2-3, G/F, Dor Hei Building, Nos.9-17 T… 22.32 114.22
#> 8 Chevalier House Shop C and Portion of Shop D on Ground Floor, … 22.30 114.18
#> 9 Shan King 2 Stall No. T-SK73, G/F, Shan King Shopping Cent… 22.40 113.97
#> 10 Shek Mun Shop No. 28, G/F, 1 On Ping Street, Shatin, NT 22.39 114.21
#> # ℹ 269 more rows
创建于 2023-11-24,使用 reprex v2.0.2