高级数据整理:如何使用 R 处理历史 Excel 数据集中奇怪的格式和空白

问题描述 投票:0回答:1

我继承了一项为期 20 年的研究,涉及 100 多个采样点,一个采样点内有多个样地。每个站点都输入到自己的 Excel 文件中,并为每年的采样提供一个新选项卡。有一些标头数据需要带入数据帧中。此外,他们没有输入重复的数据(例如,单个日期或时间段的多次观察),而是将这些单元格留空。这是一个玩具示例:

data1<-structure(list(X = c("", "", "", "Year", "Plot number 1", "", 
                            "", "", "", "10 mins", "", "", "", "", "", "", "", "", "", "", 
                            "", "Plot number 2", "", "", "", "", "", "", "", "", "", "10 mins", 
                            "", ""), X.1 = c("", "", "", "2010", "DateJulian", "141", "", 
                                             "", "", "", "", "", "", "", "", "", "", "", "", "", "", "Date - Julian", 
                                             "141", "", "", "", "", "", "", "", "", "", "", ""), Rocky.Ridge = c("Site B", 
                                                                                                                 "6 plots", "", "", "Species Code", "BBBB", "DDDD", "CCCC", "BBBB", 
                                                                                                                 "AAAA", "DDDD", "BBBB", "AAAA", "BBBB", "BBBB", "BBBB", "DDDD", 
                                                                                                                 "AAAA", "DDDD", "CCCC", "", "Species Code", "BBBB", "DDDD", "BBBB", 
                                                                                                                 "AAAA", "BBBB", "BBBB", "DDDD", "CCCC", "BBBB", "DDDD", "AAAA", 
                                                                                                                 "DDDD"), X.2 = c("", "", "", "", "audio", "2", "", "", "2", "1", 
                                                                                                                                  "2", "3", "2", "2", "2", "1", "1", "", "2", "1", "", "audio", 
                                                                                                                                  "2", "2", "3", "2", "2", "2", "", "", "1", "1", "", "2"), X.3 = c("", 
                                                                                                                                                                                                    "spring summer", "", "", "visual", "1", "1", "1", "", "", "", 
                                                                                                                                                                                                    "", "", "", "", "", "", "1", "", "", "", "visual", "1", "", "", 
                                                                                                                                                                                                    "", "", "", "1", "1", "", "", "1", "")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                 -34L))

我正在寻找一种方法来格式化这些数据,以便将它们排列在一个表中,如下所示:

data_clean<-structure(list(Study.Area = c("Rocky Ridge", "Rocky Ridge", "Rocky Ridge", 
                                          "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", 
                                          "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", 
                                          "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", 
                                          "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", 
                                          "Rocky Ridge", "Rocky Ridge", "Rocky Ridge", "Rocky Ridge"), 
                           Site = c("Site B", "Site B", "Site B", "Site B", "Site B", 
                                    "Site B", "Site B", "Site B", "Site B", "Site B", "Site B", 
                                    "Site B", "Site B", "Site B", "Site B", "Site B", "Site B", 
                                    "Site B", "Site B", "Site B", "Site B", "Site B", "Site B", 
                                    "Site B", "Site B", "Site B", "Site B"), Year = c(2010L, 
                                                                                      2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
                                                                                      2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
                                                                                      2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L), 
                           Plot.number = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                           1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                           2L), DateJulian = c(141L, 141L, 141L, 141L, 141L, 141L, 141L, 
                                                               141L, 141L, 141L, 141L, 141L, 141L, 141L, 141L, 141L, 141L, 
                                                               141L, 141L, 141L, 141L, 141L, 141L, 141L, 141L, 141L, 141L
                                           ), Time = c(5L, 5L, 5L, 5L, 10L, 10L, 10L, 10L, 10L, 10L, 
                                                       10L, 10L, 10L, 10L, 10L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
                                                       5L, 10L, 10L, 10L), Species.Code = c("BBBB", "DDDD", "CCCC", 
                                                                                            "BBBB", "AAAA", "DDDD", "BBBB", "AAAA", "BBBB", "BBBB", "BBBB", 
                                                                                            "DDDD", "AAAA", "DDDD", "CCCC", "BBBB", "DDDD", "BBBB", "AAAA", 
                                                                                            "BBBB", "BBBB", "DDDD", "CCCC", "BBBB", "DDDD", "AAAA", "DDDD"
                                                       ), audio = c(2L, 0L, 0L, 2L, 1L, 2L, 3L, 2L, 2L, 2L, 1L, 
                                                                    1L, 0L, 2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 0L, 0L, 1L, 1L, 0L, 
                                                                    2L), visual = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
                                                                                    0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 
                                                                                    0L)), class = "data.frame", row.names = c(NA, -27L))


我想在 R 中执行此操作,但我不确定是否有任何方法可以合理地自动执行此操作,或者我是否必须手动填写 excel 中未“填写”的值。我尝试过在 tidyverse 中玩,但还没有找到填充功能或将数据从标题“移动”到数据帧的方法。任何帮助 - 即使只是说我最好重新输入数据 - 不胜感激。

r excel data-manipulation data-wrangling
1个回答
0
投票

此类问题的总体思路是将“标题”中的所有值提取到列中,并本质上重建一个矩形数据框。在删除额外的标题行之前,我按位置引用它们并为它们创建新列。一旦我们在列中拥有所有标题内容,我们就可以使用

fill()
来完成列,然后过滤掉标题/标签行。请注意,
fill()
需要
NA
值而不是空字符串
""
才能工作,这就是为什么对于
DateJulian
我使用
na_if

如果像 Study Area 这样的东西的位置在文件之间不一致,您可以使用正则表达式或字符串逻辑来查找这些值,而不是每次都在相同位置查找值。

library(tidyverse)

data1 %>% 
    mutate(
        Study.Area = gsub("\\.", " ",names(.)[3]), 
        Site = .[1,3], 
        Year = .[4,2],
        Species.Code = .[,3],
        audio = X.2, 
        visual = X.3, 
        DateJulian = na_if(X.1,""), 
        Plot = ifelse(grepl("Plot", X), X, NA)) %>% 
    fill(Plot, DateJulian) %>% 
    filter(!grepl("Plot number", X), !is.na(Plot)) %>% 
    select(-starts_with("X")) %>%
    select(-1)


    Study.Area   Site Year Species.Code audio visual DateJulian          Plot
1  Rocky Ridge Site B 2010         BBBB     2      1        141 Plot number 1
2  Rocky Ridge Site B 2010         DDDD            1        141 Plot number 1
3  Rocky Ridge Site B 2010         CCCC            1        141 Plot number 1
4  Rocky Ridge Site B 2010         BBBB     2               141 Plot number 1
5  Rocky Ridge Site B 2010         AAAA     1               141 Plot number 1
6  Rocky Ridge Site B 2010         DDDD     2               141 Plot number 1
© www.soinside.com 2019 - 2024. All rights reserved.