刮取dl,dt,dd HTML数据

问题描述 投票:0回答:1

我试图使用Rvest&Selectorgadget从在线搜索中公开可用的房屋描述,并尝试按照几个在线教程来抓取网页,但我没有得到任何回复。如果有人能指出我正确的方向,我将不胜感激!

Site <- "https://paol.snb.ca/paol.html?lang=en&pan=00100004"
snb <- read_html(Site)
snb %>% html_nodes("dd") %>% html_text()
html r web-scraping html-parsing rvest
1个回答
1
投票

你不必使用RSelenium。相反,你可以玩聪明并使用隐藏的API,这要快得多:

使用Chrome中的开发者工具在网络标签中获取API网址:

enter image description here

而不是使用原始URL,而是使用隐藏的API:https://paol.snb.ca/pas-shim/api/paol/dossier/00100004

library(rvest)
library(httr)
myurl <- "https://paol.snb.ca/pas-shim/api/paol/dossier/00100004"
#you can use any user agent here
ua <- user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36")
my_cookie <- "copy_your_cookie_from_broswer_otherwise_request_will_fail_given_error_no_cookie_available"
my_session <- html_session(myurl,ua,
                         add_headers(Cookie=my_cookie))

result_list <- httr::content(my_session$response,as="parsed") # response is a json string and you will get a list using httr::content

样本结果:

> result_list$summary
$`taxAuth`
[1] "137 - HAUT-MADAWASKA"

$currAsst
[1] 7500

$curLevy
[1] 156.64

$pan
[1] "00100004"

$asstYear
[1] 2018

$imageKey
[1] ""

$description
[1] "Recreational Lot"

$location
[1] "1036 RTE 215"
© www.soinside.com 2019 - 2024. All rights reserved.