我要整理的数据在一列中具有所有特征和特征类别,我不知道如何将它们分开

问题描述 投票:0回答:1

我已经这样做了几个小时了,但我无法弄清楚。

我正在使用美国人口普查局家庭脉搏调查的以下数据集。我选择了他们每年的最新数据:

2022 - https://www2.census.gov/programs-surveys/demo/tables/hhp/2022/wk51/food1_week51.xlsx

2021 - https://www2.census.gov/programs-surveys/demo/tables/hhp/2021/wk40/food1_week40.xlsx

2020 - https://www2.census.gov/programs-surveys/demo/tables/hhp/2020/wk20/food1_week20.xlsx

我首先将所有三个文件组合在一起,以获得一个可查看所有三年的大数据框。除此之外,我将所有工作表合并到一个数据框中,因为我想查看所有不同的区域。我添加了一些列,告诉我每行数据来自哪个地区和年份。

我使用了以下代码:

# Listing file names
file_names <- c("food_sufficiency_2020.xlsx", "food_sufficiency_2021.xlsx", "food_sufficiency_2022.xlsx")

# Function to read and clean each sheet
read_and_clean_sheet <- function(file) {

# Extracting the year from the file name
year <- as.numeric(gsub("\\D", "", file))

# Reading all sheets from the Excel file
sheets <- readxl::excel_sheets(file)

# Reading each sheet and cleaning it
dfs <- lapply(sheets, function(sheet) {

# Reading the sheet
df <- readxl::read_excel(file, sheet = sheet)

# Removing the first few rows that explain the sheet's purpose
df <- df[-(1:5), ]

# Renaming the first column to "Select Characteristics"
names(df)[1] <- "Select Characteristics"

names(df)[2] <- "Total"

names(df)[3] <- "Enough of the kinds of food wanted"

names(df)[4] <- "Enough food, but not always the kinds wanted"

names(df)[5] <- "Sometimes not enough to eat"

names(df)[6] <- "Often not enough to eat"

names(df)[7] <- "Did not report"

# Adding a new column for the sheet name
df$Region <- sheet

# Adding a new column for the year
df$Year <- year

# Returning the cleaned data frame
return(df)

})

# Combining all cleaned sheets into one dataframe
combined_df <- bind_rows(dfs)

# Returning the combined dataframe
return(combined_df)

}

# Reading and cleaning all sheets from all Excel files
all_data <- lapply(file_names, read_and_clean_sheet)

# Combining all dataframes into one big dataframe
final_df <- bind_rows(all_data)

view(final_df)

如果你运行代码,你会看到“选择特征”栏里把所有的特征类别名称和对应的特征都放在一起了。

我想将所有这些分成两个不同的列。一个称为“特征类别”,另一个称为“选择特征”。因此,例如,在“特征类别”中,它会显示“年龄”。在同一行但标有“选择特征”的列中会显示“18-24”。这样,我可以更轻松地处理数据。您将能够看到每行来自哪个组。

这是我运行代码得到的输出:the somewhat tidied data 这是原始数据,以防您无法打开我发布的链接:snippit of og data in excel - 正如您所看到的,类别以粗体显示,相应的特征在其下方。我希望类别和相应的特征位于单独的列中。 这是一个我希望最终产品看起来像的例子(两列轮流):example data frame

r excel dataframe
1个回答
0
投票

给出特征标题的行不包含任何数据,因此我们可以使用该事实来获取这些标题行的位置,然后使用这些位置创建特征标题向量。以下是使用其中一个数据文件的示例:

library(tidyverse)
library(readxl)

# Read and fix column names
x=read_excel("~/downloads/food1_week51.xlsx", skip=5)

x = x %>% 
  rename("Select Characteristics"=...1, Total=...2) %>% 
  filter(!grepl("^\\*", `Select Characteristics`))

# Add a "Total" characteristic
x$`Select Characteristics`[1] = "Total"

# Get row ranges for each characteristic
b = which(is.na(x$Total))
e = c(b[-1], nrow(x) + 1)

# Create a vector of characteristics
characteristic = x$`Select Characteristics`[b]
characteristic = rep(characteristic, e - b)
x$characteristic = characteristic

# Move characteristic to first column
x = x %>% relocate(characteristic)

# Remmove characteric rows (since they don't contain data)
x = x %>% filter(!is.na(Total))

x
#> # A tibble: 155 × 8
#>    characteristic `Select Characteristics`     Total Enough of the kinds of fo…¹
#>    <chr>          <chr>                        <dbl>                       <dbl>
#>  1 Total          Total                    252481011                   122834767
#>  2 Age            18 - 24                   24389118                    10671381
#>  3 Age            25 - 39                   65021039                    30456884
#>  4 Age            40 - 54                   63862495                    29086982
#>  5 Age            55 - 64                   44053480                    21963272
#>  6 Age            65 and above              55154878                    30656247
#>  7 Sex at birth   Male                     123010040                    62804115
#>  8 Sex at birth   Female                   129470971                    60030652
#>  9 Gender         Cisgender male           117980069                    61204194
#> 10 Gender         Cisgender female         124260885                    58704795
#> # ℹ 145 more rows
#> # ℹ abbreviated name: ¹​`Enough of the kinds of food wanted`
#> # ℹ 4 more variables: `Enough food, but not always the kinds wanted` <dbl>,
#> #   `Sometimes not enough to eat` <dbl>, `Often not enough to eat` <dbl>,
#> #   `Did not report` <chr>

创建于 2024-04-16,使用 reprex v2.1.0

© www.soinside.com 2019 - 2024. All rights reserved.