带有R和PhantomJS的Web抓取交互式地图（javascript）

Question

我正在尝试从交互式地图中抓取数据（以获取某县的犯罪数据）。我正在使用R（rvest）并尝试使用phantomjs。我是网络爬虫的新手，所以我并不是很了解所有元素如何协同工作（试图到达那里）。

我认为我遇到的问题是，在运行phantomjs并使用R的rvest软件包上传html之后，最终我得到了更多的脚本，并且html中没有清晰的数据。我的代码如下。

writeLines("var url = 'http://www.google.com';
var page = new WebPage();
var fs = require('fs');

page.open(url, function (status) {
    just_wait();
});

function just_wait() {
    setTimeout(function() {
               fs.write('cool.html', page.content, 'w');
            phantom.exit();
    }, 2500);
}
", con = "scrape.js")

接受我要抓取的网址的函数

s_scrape <- function(url = "https://gis.adacounty.id.gov/apps/crimemapper/", 
                  js_path = "scrape.js", 
                  phantompath = "/Users/alihoop/Documents/phantomjs/bin/phantomjs"){

# this section will replace the url in scrape.js to whatever you want 
lines <- readLines(js_path)
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, js_path)

command = paste(phantompath, js_path, sep = " ")
system(command)

}

执行js_scrape（）函数并获取保存为“ cool.html”的html文件

js_scrape()

我不明白接下来要做什么的是下面的R代码：

map_data <- read_html('cool.html') %>%
            html_nodes('script')

我通过phantomjs在HTML中获得的输出再次只是脚本。寻找有关如何（面对时）（在我看来）如何嵌套在javascript脚本中的帮助（？）

谢谢！

Answer 1

此网站使用javascript来查询服务器。一种解决方案是重现其余请求并直接读取返回的JSON文件。这避免了使用Phantomjs的需要。

从浏览器的开发人员工具选项中，浏览xhr文件，您将找到一个名为“查询”的文件，其链接类似于：[https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields= *＆outSR = 102100＆resultOffset = 0＆resultRecordCount = 1000“]]

直接读取此JSON响应，并使用jsonlite包将其转换为列表：

library(jsonlite)
output<-jsonlite::fromJSON("https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000")
output$features
找到链接中的第一个数字，（在这种情况下为11）“ FeatureServer / 11

/ query？f = json”。该数字将确定用于查询服务器的犯罪。我发现，取值范围是0到11。纵火输入0，毒品输入4，故意破坏输入11，等等。

带有R和PhantomJS的Web抓取交互式地图（javascript）

问题描述投票：0回答：1

1个回答

最新问题

带有R和PhantomJS的Web抓取交互式地图（javascript）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1