从R中的Google Earth KML文件中提取详细信息

问题描述 投票:0回答:2

我正在尝试从Google Earth kml文件中的一系列位置获取详细信息。

获取ID和坐标是有效的,但是对于位置名称(位于描述的第一个表单元格(td标签)中),当我对所有位置进行操作时,它对所有位置都返回相同的值(斯特拉特福道-第一个位置的名称)。

library(sf)
library(tidyverse)
library(rvest)

removeHtmlTags <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}
getHtmlTableCells<- function(htmlString) {
  # Convert html to html doc
  htmldoc <- read_html(htmlString)
  # get html for each cell (i.e. within <td></td>)
  table_cells_with_tags <- html_nodes(htmldoc, "td")
  # remove the html tags (<td></td>)
  return(removeHtmlTags(table_cells_with_tags))[1]
}

download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
  rename(id = Name) %>%
  mutate(latitude = st_coordinates(geometry)[,1],
         longitiude = st_coordinates(geometry)[,2],
         name = getHtmlTableCells(Description)[1]) %>%
  st_drop_geometry()

现在,如果我在特定位置使用该函数并获取第一个表格单元格(td),则它将起作用,如下所示,第一个返回斯特拉特福德路和塞利·奥克。

getHtmlTableCells(locations$Description[1])[1]
getHtmlTableCells(locations$Description[2])[1]

我在做什么错?

r kml rvest sf
2个回答
0
投票

read_html未向量化-它不接受要解析的其他html向量。我们可以在向量的每个元素上apply您的函数:

locations <- st_read("aqms.kml", stringsAsFactors = FALSE) 

locations %>%
  rename(id = Name) %>%
  mutate(latitude = st_coordinates(geometry)[,1],
         longitiude = st_coordinates(geometry)[,2],
         name = sapply(Description, function(x) getHtmlTableCells(x)[1])) %>%
  st_drop_geometry()

#>     latitude longitiude                      name
#> 1  -1.871622   52.45920            Stratford Road
#> 2  -1.934559   52.44513  Selly Oak (Bristol Road)
#> 3  -1.830070   52.43771              Acocks Green
#> 4  -1.898731   52.48180               Colmore Row
#> 5  -1.896764   52.48607        St Chads Queensway
#> 6  -1.891955   52.47990     Moor Street Queensway
#> 7  -1.918173   52.48138       Birmingham Ladywood
#> 8  -1.902121   52.47675       Lower Severn Street
#> 9  -1.786413   52.56815                  New Hall
#> 10 -1.874989   52.47609 Birmingham A4540 Roadside

0
投票

您的getHtmlTableCells函数未向量化。如果将单个html字符串传递给它,则可以正常工作,但是如果将多个字符串传递给它,它将仅处理第一个字符串。另外,您还放置了[1] after return语句,该语句不执行任何操作。它必须在方括号内。您只需执行一下操作,就很容易使用sapply对函数进行向量化。

因此,对您的功能进行微小的更改...

getHtmlTableCells <- function(htmlString) {
  # Convert html to html doc
  htmldoc <- read_html(htmlString)
  # get html for each cell (i.e. within <td></td>)
  table_cells_with_tags <- html_nodes(htmldoc, "td")
  # remove the html tags (<td></td>)
  return(removeHtmlTags(table_cells_with_tags)[1])
}

并像这样向量化它:

download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")

locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
  rename(id = Name) %>%
  mutate(latitude = st_coordinates(geometry)[,1],
         longitiude = st_coordinates(geometry)[,2],
         name = sapply(as.list(Description), getHtmlTableCells)) %>%
  st_drop_geometry()

哪个给出正确的结果:

locations$name
#>  [1] "Stratford Road"            "Selly Oak (Bristol Road)" 
#>  [3] "Acocks Green"              "Colmore Row"              
#>  [5] "St Chads Queensway"        "Moor Street Queensway"    
#>  [7] "Birmingham Ladywood"       "Lower Severn Street"      
#>  [9] "New Hall"                  "Birmingham A4540 Roadside"
© www.soinside.com 2019 - 2024. All rights reserved.