使用ddply的dataframe manupulation

问题描述 投票:2回答:1

我有一个名为output output dataframe的数据帧

我想为每个不同的code生成patientID的模式(最重复)和每个不同的patientID的上述code的独特zipcode计数。

我试过这个:

ddply(output,~zipcode,summarize,max=mode(code))

这个代码将为每个不同的code生成zipcode的模式...但我想生成code模式为不同的patientID内不同的zipcode

output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))

my output=
zipcode most_rep_code patient_count
1     718         E78.5             1
2     719         D64.9             1
r dataframe plyr
1个回答
0
投票

如果我理解你需要找到codepatientID频率最高的zipcode,那么dplyr可能会有用。我认为您需要将上述3列作为分组变量,然后使用summarise来获取每组的计数。每行中最高的是模式。新列给出了模式的计数。

# Your reprex data
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO")) 

library(dplyr)
output %>% 
  dplyr::group_by(patientID, code, zipcode) %>% 
  dplyr::summarise(mode_freq = n())

# A tibble: 5 x 4
# Groups:   patientID, code [5]
  patientID code   zipcode  freq
<fct>     <fct>    <dbl> <int>
1 34423     E78.5      718     2
2 34423     I65.29     718     1
3 34423     N08        718     1
4 34424     Z68.29     718     1
5 34425     D64.9      719     1

我已经包括dplyr::,因为我假设你加载了plyr,所以函数名称会发生​​冲突。

更新:

要获得建议的模式输出,根据定义它应该是最高频率:

output %>% 
  group_by(patientID, code, zipcode) %>% 
  summarise(mode_freq = n()) %>%
  ungroup() %>% 
  group_by(zipcode) %>% 
  filter(mode_freq == max(mode_freq))

# A tibble: 2 x 4
# Groups:   zipcode [2]
  patientID code  zipcode mode_freq
<fct>     <fct>   <dbl>     <int>
1 34423     E78.5     718         2
2 34425     D64.9     719         1
© www.soinside.com 2019 - 2024. All rights reserved.