以表格格式提取 XML 数据

问题描述 投票:0回答:3

我有一个 xml 文件,我想从中提取数据。最终,我需要的是一个显示节点名称(即

NODE36
NODE44
)的表格以及表格中的信息(参见下面的所需输出)。

有没有办法使用

regex
或 XML 解析器将数据提取为表格?

<?xml version="1.0" encoding="UTF-8"?>
<Document>
    <name>culverts.XML</name>
    <StyleMap id="m_ylw-pushpin29">
        <Pair>
            <key>normal</key>
            <styleUrl>#s_ylw-pushpin00</styleUrl>
        </Pair>
        <Pair>
            <key>highlight</key>
            <styleUrl>#s_ylw-pushpin_hl25</styleUrl>
        </Pair>
    </StyleMap>
<Folder>
        <name>culverts.XML</name>
        <open>1</open>
        <description>Culvert</description>
        <Placemark>
            <name>NODE36</name>
            <description><![CDATA[<br><br><br>
    <table border="1" padding="0">
    <tr><td>Objectid</td><td>1</td></tr>
    <tr><td>On_route</td><td>Mid Turnpike</td></tr>
    <tr><td>Road_numbe</td><td>54</td></tr>
    <tr><td>Recommenda</td><td>Continue to monitor.</td></tr>]]></description>
            <styleUrl>#m_ylw-pushpin29</styleUrl>
            <Point>
                <extrude>1</extrude>
                <altitudeMode>relativeToGround</altitudeMode>
                <coordinates>-74.249045,45.997986,0</coordinates>
            </Point>
        </Placemark>
        <Placemark>
            <name>NODE44</name>
            <description><![CDATA[<br><br><br>
    <table border="1" padding="0">
    <tr><td>Objectid</td><td>2</td></tr>
    <tr><td>On_route</td><td>Mid Turnpike</td></tr>
    <tr><td>Road_numbe</td><td>54</td></tr>
    <tr><td>Recommenda</td><td>Not Available.</td></tr>]]></description>
            <styleUrl>#m_ylw-pushpin29</styleUrl>
            <Point>
                <extrude>1</extrude>
                <altitudeMode>relativeToGround</altitudeMode>
                <coordinates>-74.24906300000001,45.998057,0</coordinates>
            </Point>
        </Placemark>
    </Folder>
</Document>

期望的输出:

姓名 Objectid 在路上 Road_numbe 推荐
NODE36 1 中收费公路 54 持续关注
NODE44 2 中收费公路 54 不可用。

我试过

regex
提取
<Placemark>
</Placemark>
之间的数据无济于事;

library(qdapRegex)
my_tbl <- rm_between(file_str, 'Placemark', '/Placemark', extract=TRUE)[[1]]

my_tbl <- str_extract_all(file_str, "Placemark((.|\n)*)/Placemark")
Error in stri_extract_all_regex(string, pattern, simplify = simplify,  : 
  Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)

我无法让这个在 R 中工作。尽管即使我可以,它也匹配第一次出现的

<Placemark>
和最后一次出现的
</Placemark>
;看这里:https://regex101.com/r/bQOdDJ/1

r regex xml xml-parsing xml2
3个回答
1
投票

这是一个带有辅助函数的方法,可以将 HTML 表格转换为数据框。基本上我们需要对 HTML 数据进行一系列迭代和解析。

library(xml2)
library(purrr)
doc <- xml2::read_xml(xx)

table_to_dataframe <- function(x) {
  x |> xml_find_all(".//tr") |> 
    map(function(x) {
      x |> xml_find_all("./td") |> xml_text()
    }) |>
    do.call("rbind", args=_) |>
    (function(x) setNames(x[,2], x[,1]))() |>
    bind_rows()
}

doc |>
  xml_find_all("//Placemark") |>
  map_df(function(p) {
    name <- p |> xml_find_first("./name") |> xml_text()
    sub <- p |> xml_find_first("./description") |> xml_text() |> read_html()
    bind_cols(tibble(name), table_to_dataframe(sub))
  })

哪个返回

  name   Objectid On_route     Road_numbe Recommenda          
  <chr>  <chr>    <chr>        <chr>      <chr>               
1 NODE36 1        Mid Turnpike 54         Continue to monitor.
2 NODE44 2        Mid Turnpike 54         Not Available. 

1
投票
library(rvest)
library(tidyverse)

read_html(your_page, options = "HUGE")%>%
   html_node('table')%>%
   html_table(fill = TRUE) %>%
   mutate(row = cumsum(X1 =='Objectid'))%>%
   pivot_wider(names_from = X1, values_from = X2)%>%
   type.convert(as.is =TRUE)

# A tibble: 3 × 5
    row Objectid On_route     Road_numbe Recommenda          
  <int>    <int> <chr>             <int> <chr>               
1     1        1 Mid Turnpike         54 Continue to monitor.
2     2        2 Mid Turnpike         54 Not Available. 

0
投票

由于嵌套的 HTML 非常棘手,但这里有一个 XSLT 解决方案,它使用为 XSLT 4.0 提出并在 Saxon 12 中实现的

parse-html()
函数:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   version="4.0" expand-text="yes">
  
   <xsl:output method="html" indent="yes"/>  
   <xsl:variable name="tables">
      <xsl:for-each select="//Placemark">
         <data>
            <xsl:copy-of select="name"/>
            <description>
               <xsl:sequence select="parse-html(description)//*:table"/>
            </description>
         </data>
      </xsl:for-each>
   </xsl:variable>
   
   <xsl:template match="/">
      <table>
         <thead>
            <tr>
               <th>Name</th>
               <xsl:for-each select="$tables/data[1]//*:tr">
                  <th>{*:td[1]}</th>                 
               </xsl:for-each>
            </tr>
         </thead>
         <tbody>
            <xsl:for-each select="$tables/data">
               <tr>
                  <td>{name}</td>
                  <xsl:for-each select="description//*:tr">
                     <td>{*:td[2]}</td>
                  </xsl:for-each>
               </tr>
            </xsl:for-each>
         </tbody>
      </table>
   </xsl:template>
   
</xsl:stylesheet>

输出为:

<table>
   <thead>
      <tr>
         <th>Name</th>
         <th>Objectid</th>
         <th>On_route</th>
         <th>Road_numbe</th>
         <th>Recommenda</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td>NODE36</td>
         <td>1</td>
         <td>Mid Turnpike</td>
         <td>54</td>
         <td>Continue to monitor.</td>
      </tr>
      <tr>
         <td>NODE44</td>
         <td>2</td>
         <td>Mid Turnpike</td>
         <td>54</td>
         <td>Not Available.</td>
      </tr>
   </tbody>
</table>
© www.soinside.com 2019 - 2024. All rights reserved.