我有一个名为output output dataframe的数据帧
我想为每个不同的code
生成patientID
的模式(最重复)和每个不同的patientID
的上述code
的独特zipcode
计数。
我试过这个:
ddply(output,~zipcode,summarize,max=mode(code))
这个代码将为每个不同的code
生成zipcode
的模式...但我想生成code
模式为不同的patientID
内不同的zipcode
。
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))
my output=
zipcode most_rep_code patient_count
1 718 E78.5 1
2 719 D64.9 1
如果我理解你需要找到code
和patientID
频率最高的zipcode
,那么dplyr
可能会有用。我认为您需要将上述3列作为分组变量,然后使用summarise
来获取每组的计数。每行中最高的是模式。新列给出了模式的计数。
# Your reprex data
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))
library(dplyr)
output %>%
dplyr::group_by(patientID, code, zipcode) %>%
dplyr::summarise(mode_freq = n())
# A tibble: 5 x 4
# Groups: patientID, code [5]
patientID code zipcode freq
<fct> <fct> <dbl> <int>
1 34423 E78.5 718 2
2 34423 I65.29 718 1
3 34423 N08 718 1
4 34424 Z68.29 718 1
5 34425 D64.9 719 1
我已经包括dplyr::
,因为我假设你加载了plyr
,所以函数名称会发生冲突。
更新:
要获得建议的模式输出,根据定义它应该是最高频率:
output %>%
group_by(patientID, code, zipcode) %>%
summarise(mode_freq = n()) %>%
ungroup() %>%
group_by(zipcode) %>%
filter(mode_freq == max(mode_freq))
# A tibble: 2 x 4
# Groups: zipcode [2]
patientID code zipcode mode_freq
<fct> <fct> <dbl> <int>
1 34423 E78.5 718 2
2 34425 D64.9 719 1