将 XML 表转换为 R 中的 tibble

问题描述 投票:0回答:1

我正在尝试将具有不同行数和列数的 XML 表转换为数据帧。我可以使用格式良好、可预测的表格来做到这一点,就像这两个表格一样:

 <table xml:id="a">
  <row role="label">
   <cell cols="2">Stuff</cell>
  </row>
  <row>
   <cell>Thing</cell>
   <cell>1</cell>
  </row>
  <row>
   <cell>Another thing</cell>
   <cell>2</cell>
  </row>
 </table>
 <table xml:id="b">
  <row role="label">
   <cell cols="2">Nonsense</cell>
  </row>
  <row>
   <cell>Thing</cell>
   <cell>3</cell>
  </row>
  <row>
   <cell>Anything</cell>
   <cell>2</cell>
  </row>
  <row>
   <cell>Another thing</cell>
   <cell>2</cell>
  </row>
 </table>

我可以将它们更改为这样的小标题:

# A tibble: 5 × 4
  id    label    cell.1        cell.2
  <chr> <chr>    <chr>          <dbl>
1 a     Stuff    Thing              1
2 a     Stuff    Another thing      2
3 b     Nonsense Thing              3
4 b     Nonsense Anything           2
5 b     Nonsense Another thing      2

使用此代码:

x <- "table.xml"
file <- read_xml(x)
cells <- file %>% xml_find_all(".//cell")
output <- lapply(cells, function(d){
  id <- d %>% 
    xml_find_first(".//parent::row/parent::table")%>% 
    xml_attr("id")
  label <- d %>% 
    xml_find_first(".//parent::row/preceding-sibling::row[@role='label']")%>% 
    xml_text()
  cell.1 <- d %>% 
    xml_find_first(".//parent::row/cell")%>% 
    xml_text() 
  cell.2 <- d %>% 
    xml_find_all(".//following-sibling::cell")%>%
    xml_double() 
  tibble(id, label, cell.1, cell.2)
})
answer <- do.call(rbind, output)

但是,这种方法依赖于提供的属性 (

@role='label'
)、一致的单元格数量等。我需要在一堆格式不规则的 XML 表上运行此脚本。

如果我向上例中的某一行添加一个额外的单元格,我的方法就会失败。我怀疑我可能以错误的方式处理这个问题。例如,我可以用

xml2::as_list()
来做到这一点吗?我的尝试还没有成功。

r xml xpath tidyverse xml2
1个回答
0
投票

这是一种使用 rvest 包和 tidyverse 转换中的

html_table
的可能方法。注意:您显示的 XML 无效。

### Packages
library(xml2)
library(rvest)
library(stringr)
library(dplyr)
library(purrr)

### Parse the XML and transform the result as character
a=read_xml("C:/Users/YourName/Downloads/YourFile.xml")
b=as.character(a)

### Replace the content of the XML to conform the tables to HTML tables structure
b=str_replace_all(b,"cell cols","td colspan")
b=str_replace_all(b,"row","tr")
b=str_replace_all(b,"cell","td")

### Parse the result of the transformation
c=read_xml(b)

### Get all ids of the tables
attr=html_elements(c,xpath = "//table") %>% html_attrs() %>% unlist()

### Get all the tables
temp=c%>%
  html_elements(xpath = "//table") %>%
  html_table()

### Declare a function to transform the tables
### Transform the last column from character to numeric
transform=function(x,y){x %>% slice(-1) %>% mutate(id=y,
                                  label=x[1,1][[1]],.before=1,
                                  X2=as.numeric(X2))}

### Apply the function
done=map2(.x = temp,.y = attr,.f = transform)

### Stack the tables and rename the columns
end=bind_rows(gigachad) %>% rename_with(.fn = ~str_replace(.x,"X","cell."),
                                    .cols = starts_with("X"))

输出:

# A tibble: 5 × 4
  id    label    cell.1        cell.2
  <chr> <chr>    <chr>          <dbl>
1 a     Stuff    Thing              1
2 a     Stuff    Another thing      2
3 b     Nonsense Thing              3
4 b     Nonsense Anything           2
5 b     Nonsense Another thing      2
© www.soinside.com 2019 - 2024. All rights reserved.