我已经这样做了几个小时了,但我无法弄清楚。
我正在使用美国人口普查局家庭脉搏调查的以下数据集。我选择了他们每年的最新数据:
2022 - https://www2.census.gov/programs-surveys/demo/tables/hhp/2022/wk51/food1_week51.xlsx
2021 - https://www2.census.gov/programs-surveys/demo/tables/hhp/2021/wk40/food1_week40.xlsx
2020 - https://www2.census.gov/programs-surveys/demo/tables/hhp/2020/wk20/food1_week20.xlsx
我首先将所有三个文件组合在一起,以获得一个可查看所有三年的大数据框。除此之外,我将所有工作表合并到一个数据框中,因为我想查看所有不同的区域。我添加了一些列,告诉我每行数据来自哪个地区和年份。
我使用了以下代码:
# Listing file names
file_names <- c("food_sufficiency_2020.xlsx", "food_sufficiency_2021.xlsx", "food_sufficiency_2022.xlsx")
# Function to read and clean each sheet
read_and_clean_sheet <- function(file) {
# Extracting the year from the file name
year <- as.numeric(gsub("\\D", "", file))
# Reading all sheets from the Excel file
sheets <- readxl::excel_sheets(file)
# Reading each sheet and cleaning it
dfs <- lapply(sheets, function(sheet) {
# Reading the sheet
df <- readxl::read_excel(file, sheet = sheet)
# Removing the first few rows that explain the sheet's purpose
df <- df[-(1:5), ]
# Renaming the first column to "Select Characteristics"
names(df)[1] <- "Select Characteristics"
names(df)[2] <- "Total"
names(df)[3] <- "Enough of the kinds of food wanted"
names(df)[4] <- "Enough food, but not always the kinds wanted"
names(df)[5] <- "Sometimes not enough to eat"
names(df)[6] <- "Often not enough to eat"
names(df)[7] <- "Did not report"
# Adding a new column for the sheet name
df$Region <- sheet
# Adding a new column for the year
df$Year <- year
# Returning the cleaned data frame
return(df)
})
# Combining all cleaned sheets into one dataframe
combined_df <- bind_rows(dfs)
# Returning the combined dataframe
return(combined_df)
}
# Reading and cleaning all sheets from all Excel files
all_data <- lapply(file_names, read_and_clean_sheet)
# Combining all dataframes into one big dataframe
final_df <- bind_rows(all_data)
view(final_df)
如果你运行代码,你会看到“选择特征”栏里把所有的特征类别名称和对应的特征都放在一起了。
我想将所有这些分成两个不同的列。一个称为“特征类别”,另一个称为“选择特征”。因此,例如,在“特征类别”中,它会显示“年龄”。在同一行但标有“选择特征”的列中会显示“18-24”。这样,我可以更轻松地处理数据。您将能够看到每行来自哪个组。
这是我运行代码得到的输出: 这是原始数据,以防您无法打开我发布的链接: - 正如您所看到的,类别以粗体显示,相应的特征在其下方。我希望类别和相应的特征位于单独的列中。 这是一个我希望最终产品看起来像的例子(两列轮流):
给出特征标题的行不包含任何数据,因此我们可以使用该事实来获取这些标题行的位置,然后使用这些位置创建特征标题向量。以下是使用其中一个数据文件的示例:
library(tidyverse)
library(readxl)
# Read and fix column names
x=read_excel("~/downloads/food1_week51.xlsx", skip=5)
x = x %>%
rename("Select Characteristics"=...1, Total=...2) %>%
filter(!grepl("^\\*", `Select Characteristics`))
# Add a "Total" characteristic
x$`Select Characteristics`[1] = "Total"
# Get row ranges for each characteristic
b = which(is.na(x$Total))
e = c(b[-1], nrow(x) + 1)
# Create a vector of characteristics
characteristic = x$`Select Characteristics`[b]
characteristic = rep(characteristic, e - b)
x$characteristic = characteristic
# Move characteristic to first column
x = x %>% relocate(characteristic)
# Remmove characteric rows (since they don't contain data)
x = x %>% filter(!is.na(Total))
x
#> # A tibble: 155 × 8
#> characteristic `Select Characteristics` Total Enough of the kinds of fo…¹
#> <chr> <chr> <dbl> <dbl>
#> 1 Total Total 252481011 122834767
#> 2 Age 18 - 24 24389118 10671381
#> 3 Age 25 - 39 65021039 30456884
#> 4 Age 40 - 54 63862495 29086982
#> 5 Age 55 - 64 44053480 21963272
#> 6 Age 65 and above 55154878 30656247
#> 7 Sex at birth Male 123010040 62804115
#> 8 Sex at birth Female 129470971 60030652
#> 9 Gender Cisgender male 117980069 61204194
#> 10 Gender Cisgender female 124260885 58704795
#> # ℹ 145 more rows
#> # ℹ abbreviated name: ¹`Enough of the kinds of food wanted`
#> # ℹ 4 more variables: `Enough food, but not always the kinds wanted` <dbl>,
#> # `Sometimes not enough to eat` <dbl>, `Often not enough to eat` <dbl>,
#> # `Did not report` <chr>
创建于 2024-04-16,使用 reprex v2.1.0