我正在尝试将具有不同行数和列数的 XML 表转换为数据帧。我可以使用格式良好、可预测的表格来做到这一点,就像这两个表格一样:
<table xml:id="a">
<row role="label">
<cell cols="2">Stuff</cell>
</row>
<row>
<cell>Thing</cell>
<cell>1</cell>
</row>
<row>
<cell>Another thing</cell>
<cell>2</cell>
</row>
</table>
<table xml:id="b">
<row role="label">
<cell cols="2">Nonsense</cell>
</row>
<row>
<cell>Thing</cell>
<cell>3</cell>
</row>
<row>
<cell>Anything</cell>
<cell>2</cell>
</row>
<row>
<cell>Another thing</cell>
<cell>2</cell>
</row>
</table>
我可以将它们更改为这样的小标题:
# A tibble: 5 × 4
id label cell.1 cell.2
<chr> <chr> <chr> <dbl>
1 a Stuff Thing 1
2 a Stuff Another thing 2
3 b Nonsense Thing 3
4 b Nonsense Anything 2
5 b Nonsense Another thing 2
使用此代码:
x <- "table.xml"
file <- read_xml(x)
cells <- file %>% xml_find_all(".//cell")
output <- lapply(cells, function(d){
id <- d %>%
xml_find_first(".//parent::row/parent::table")%>%
xml_attr("id")
label <- d %>%
xml_find_first(".//parent::row/preceding-sibling::row[@role='label']")%>%
xml_text()
cell.1 <- d %>%
xml_find_first(".//parent::row/cell")%>%
xml_text()
cell.2 <- d %>%
xml_find_all(".//following-sibling::cell")%>%
xml_double()
tibble(id, label, cell.1, cell.2)
})
answer <- do.call(rbind, output)
但是,这种方法依赖于提供的属性 (
@role='label'
)、一致的单元格数量等。我需要在一堆格式不规则的 XML 表上运行此脚本。
如果我向上例中的某一行添加一个额外的单元格,我的方法就会失败。我怀疑我可能以错误的方式处理这个问题。例如,我可以用
xml2::as_list()
来做到这一点吗?我的尝试还没有成功。
这是一种使用 rvest 包和 tidyverse 转换中的
html_table
的可能方法。注意:您显示的 XML 无效。
### Packages
library(xml2)
library(rvest)
library(stringr)
library(dplyr)
library(purrr)
### Parse the XML and transform the result as character
a=read_xml("C:/Users/YourName/Downloads/YourFile.xml")
b=as.character(a)
### Replace the content of the XML to conform the tables to HTML tables structure
b=str_replace_all(b,"cell cols","td colspan")
b=str_replace_all(b,"row","tr")
b=str_replace_all(b,"cell","td")
### Parse the result of the transformation
c=read_xml(b)
### Get all ids of the tables
attr=html_elements(c,xpath = "//table") %>% html_attrs() %>% unlist()
### Get all the tables
temp=c%>%
html_elements(xpath = "//table") %>%
html_table()
### Declare a function to transform the tables
### Transform the last column from character to numeric
transform=function(x,y){x %>% slice(-1) %>% mutate(id=y,
label=x[1,1][[1]],.before=1,
X2=as.numeric(X2))}
### Apply the function
done=map2(.x = temp,.y = attr,.f = transform)
### Stack the tables and rename the columns
end=bind_rows(gigachad) %>% rename_with(.fn = ~str_replace(.x,"X","cell."),
.cols = starts_with("X"))
输出:
# A tibble: 5 × 4
id label cell.1 cell.2
<chr> <chr> <chr> <dbl>
1 a Stuff Thing 1
2 a Stuff Another thing 2
3 b Nonsense Thing 3
4 b Nonsense Anything 2
5 b Nonsense Another thing 2