我是 R 的初学者。我正在使用 dplyr 进行一些数据操作练习,但遇到了一些我不太明白的事情。
我正在进行一项练习,该练习使用 tidyverse 中的“泰坦尼克号”训练数据集。该练习是将按性别划分的平均值归入所有年龄为 NA 的乘客观察值。
有效的代码片段是这样的:
`#' Exercise 6.5
#' Use the case_when function to create a new column in the titanic dataset called imputed_age_of_passenger.
#' In this column we should have, wherever the value of the sex_of_passenger is “male” and age of passenger value is
#'missing the imputed value should be the mean age_of_passenger of only the male passengers. wherever the value
#'sex_of_passenger is “female” and age of passenger value is missing the imputed value should be imputed with the mean
#'age_of_passenger of only the female passengers. Otherwise, take the value of the age_of_passenger.
mean_age_male <- titanic %>%
filter(sex_of_passenger == "male") %>%
pull(age_of_passenger) %>%
mean(na.rm = TRUE)
mean_age_female <- titanic %>%
filter(sex_of_passenger == "female") %>%
pull(age_of_passenger) %>%
mean(na.rm = TRUE)
titanic <- titanic %>%
mutate(imputed_age_of_passenger = case_when(
sex_of_passenger == "male" & is.na(age_of_passenger) ~ mean_age_male,
sex_of_passenger == "female" & is.na(age_of_passenger) ~ mean_age_female,
TRUE ~ age_of_passenger))`
我试图让代码更流畅、更简洁。我没有看到为内存中性别平均值创建两个辅助变量的意义,除了在数据集上创建新列之外,我不需要这些变量。所以我尝试在一个管道中完成所有操作,如下所示:
titanic <- titanic %>%
mutate(imputed_age_of_passenger = case_when(
sex_of_passenger == "male" & is.na(age_of_passenger) ~ mean(filter(sex_of_passenger == "male")$age_of_passenger, na.rm = TRUE),
sex_of_passenger == "female" & is.na(age_of_passenger) ~ mean(filter(sex_of_passenger == "female")$age_of_passenger, na.rm = TRUE)
TRUE ~ age_of_passenger))
但是,我收到以下错误:
Error: unexpected numeric constant in: " sex_of_passenger == "female" & is.na(age_of_passenger) ~ mean(filter(sex_of_passenger == "female")$age_of_passenger, na.rm = TRUE) TRUE"
同样,我尝试通过将 pull() 和 Mean() 函数组合在一行中而不是作为管道中的序列来稍微简化两个辅助变量的定义,如下所示:
mean_age_male <- titanic %>%
filter(sex_of_passenger == "male") %>%
mean(pull(age_of_passenger), na.rm = TRUE)
但是,虽然上面的方法确实有效,但它会将 NA 存储到mean_age_male 变量中,并显示以下警告:
Warning message:
In mean.default(., na.rm = TRUE) :
argument is not numeric or logical: returning NA
有人可以告诉我为什么上面的代码片段都没有按预期工作吗?预先感谢!
对于第一个代码,我认为您在分配值时尝试过滤数据的方式有问题。
你可以尝试下面的方法 df["执行均值计算的列的列名"][df["过滤器的列名"] == "过滤器值"]
titanic <- titanic %>%
mutate(imputed_age_of_passenger = case_when(
sex_of_passenger == "male" & is.na(age_of_passenger) ~ mean(titanic["age_of_passenger][titanic["sex_of_passenger"] == "male"], na.rm = TRUE),
sex_of_passenger == "female" & is.na(age_of_passenger) ~ mean(titanic["age_of_passenger][titanic["sex_of_passenger"] == "female"], na.rm = TRUE)
TRUE ~ age_of_passenger))
如果您想避免保存平均年龄的值,您可以做的是计算平均年龄,然后分配所需的值,如下所示
titanic %>%
group_by(sex_of_passenger) %>%
summarise( Avg_Age = mean(age_of_passenger, na.rm = TRUE))
现在您将获得两个平均年龄值。
您可以手动分配值,如下所示
titanic["imputed_age_of_passenger"][is.na(titanic["age_of_passenger"]) & titanic["sex_of_passenger"] == "male"] <- first value
titanic["imputed_age_of_passenger"][is.na(titanic["age_of_passenger"]) & titanic["sex_of_passenger"] == "female"] <- second value