我有一个像这样的数据框:
wide_df <- data.frame(
kingdom = c("Animalia", "Animalia", "Plantae", "Plantae"),
phylum = c("Chordata", "Chordata", "Angiosperms", "Angiosperms"),
class = c("Mammalia", "Mammalia", "Dicotyledons", "Dicotyledons"),
order = c("Carnivora", "Carnivora", "Rosales", "Solanales"),
family = c("Felidae", "Canidae", "Rosaceae", "Solanaceae"),
count = c(2, 3, 1, 4)
)
> wide_df
kingdom phylum class order family count
1 Animalia Chordata Mammalia Carnivora Felidae 2
2 Animalia Chordata Mammalia Carnivora Canidae 3
3 Plantae Angiosperms Dicotyledons Rosales Rosaceae 1
4 Plantae Angiosperms Dicotyledons Solanales Solanaceae 4
我想更改数据结构,使其看起来像这样:
hierarchical_df <- data.frame(
name = c("Animalia",
"Animalia",
"Animalia",
"Animalia",
"Animalia",
"Chordata",
"Chordata",
"Chordata",
"Chordata",
"Chordata",
"Mammalia",
"Mammalia",
"Mammalia",
"Mammalia",
"Mammalia",
"Carnivora",
"Carnivora",
"Carnivora",
"Carnivora",
"Carnivora",
"Felidae",
"Felidae",
"Canidae",
"Canidae",
"Canidae",
"Plantae",
"Plantae",
"Plantae",
"Plantae",
"Plantae",
"Angiosperms",
"Angiosperms",
"Angiosperms",
"Angiosperms",
"Angiosperms",
"Dicotyledons",
"Dicotyledons",
"Dicotyledons",
"Dicotyledons",
"Dicotyledons",
"Rosales",
"Solanales",
"Solanales",
"Solanales",
"Solanales",
"Rosaceae",
"Solanaceae",
"Solanaceae",
"Solanaceae",
"Solanaceae"),
parent = c(NA,
NA,
NA,
NA,
NA,
"Animalia",
"Animalia",
"Animalia",
"Animalia",
"Animalia",
"Chordata",
"Chordata",
"Chordata",
"Chordata",
"Chordata",
"Mammalia",
"Mammalia",
"Mammalia",
"Mammalia",
"Mammalia",
"Carnivora",
"Carnivora",
"Carnivora",
"Carnivora",
"Carnivora",
NA,
NA,
NA,
NA,
NA,
"Plantae",
"Plantae",
"Plantae",
"Plantae",
"Plantae",
"Angiosperms",
"Angiosperms",
"Angiosperms",
"Angiosperms",
"Angiosperms",
"Dicotyledons",
"Dicotyledons",
"Dicotyledons",
"Dicotyledons",
"Dicotyledons",
"Rosales",
"Solanales",
"Solanales",
"Solanales",
"Solanales"))
hierarchical_df
name parent
1 Animalia <NA>
2 Animalia <NA>
3 Animalia <NA>
4 Animalia <NA>
5 Animalia <NA>
6 Chordata Animalia
7 Chordata Animalia
8 Chordata Animalia
9 Chordata Animalia
10 Chordata Animalia
11 Mammalia Chordata
12 Mammalia Chordata
13 Mammalia Chordata
14 Mammalia Chordata
15 Mammalia Chordata
16 Carnivora Mammalia
17 Carnivora Mammalia
18 Carnivora Mammalia
19 Carnivora Mammalia
20 Carnivora Mammalia
21 Felidae Carnivora
22 Felidae Carnivora
23 Canidae Carnivora
24 Canidae Carnivora
25 Canidae Carnivora
26 Plantae <NA>
27 Plantae <NA>
28 Plantae <NA>
29 Plantae <NA>
30 Plantae <NA>
31 Angiosperms Plantae
32 Angiosperms Plantae
33 Angiosperms Plantae
34 Angiosperms Plantae
35 Angiosperms Plantae
36 Dicotyledons Angiosperms
37 Dicotyledons Angiosperms
38 Dicotyledons Angiosperms
39 Dicotyledons Angiosperms
40 Dicotyledons Angiosperms
41 Rosales Dicotyledons
42 Solanales Dicotyledons
43 Solanales Dicotyledons
44 Solanales Dicotyledons
45 Solanales Dicotyledons
46 Rosaceae Rosales
47 Solanaceae Solanales
48 Solanaceae Solanales
49 Solanaceae Solanales
50 Solanaceae Solanales
基本上,我试图将我的数据转换为一种形式,我可以使用它来使用此包制作桑基图(https://github.com/fbreitwieser/hiervis)。我试图对给定区域中不同分类群的个体生物体的数量进行可视化。数据集中有 40,000 多个观察值。
这里有一个方法。
您想要的是原始宽格式 df 仅在一列中,然后第二列是该列滞后。
tmp <- wide_df[rep(row.names(wide_df), wide_df$count), ]
long_df <- stack(tmp[-6])
long_df$parent <- dplyr::lag(long_df$values, sum(long_df$ind == "family"))
rm(tmp)
names(long_df)[1L] <- "name"
long_df <- long_df[-2L]
这是发布的想要的结果
identical
,但排序不同:
# check the result
i <- order(hierarchical_df$name)
j <- order(long_df$name)
tmp1 <- hierarchical_df[i, ]
tmp2 <- long_df[j, ]
row.names(tmp1) <- NULL
row.names(tmp2) <- NULL
identical(tmp1, tmp2)
#> [1] TRUE