我有一个具有代码字符串对的数据集。换句话说,有一列代码和对应的一列字符串是这些代码的描述。
问题是三重...
组成示例:
+-------+------------------+
| CODE | STRING |
+-------+------------------+
| A1 | broken bones |
| A1 | broken bones |
| NA | broken bones |
| A1 | bones, broken |
| A1 | bones, fracture |
| A1 | NA |
| B1 | red blood cells |
| B1 | red blood cells |
| B1 | blood cells, red|
| B1 | NA |
| B1 | erythrocytes |
| NA | broken bones |
| C1 | liver disease |
| C1 | liver disease |
| C1 | hepatic illness |
| C1 | NA |
| C1 | disease, liver |
| NA | liver disease |
+-------+------------------+
我的问题是...
这里是一种方法:
get_commonest <- function(level, code, string)
{
names(rev(sort(table(na.omit(string[code == level])))))[1]
}
codes <- na.omit(unique(df$CODE))
strings <- na.omit(unique(df$STRING))
default_strings <- as.data.frame(sapply(codes, get_commonest, df$CODE, df$STRING))
default_codes <- as.data.frame(sapply(strings, get_commonest, df$STRING, df$CODE))
df$CODE[is.na(df$CODE)] <- as.character(default_codes[df$STRING[is.na(df$CODE)],])
df$STRING[is.na(df$STRING)] <- as.character(default_strings[df$CODE[is.na(df$STRING)],])
为您提供此结果:
df
#> CODE STRING
#> 2 A1 broken bones
#> 3 A1 broken bones
#> 4 A1 broken bones
#> 5 A1 bones, broken
#> 6 A1 bones, fracture
#> 7 A1 broken bones
#> 8 B1 red blood cells
#> 9 B1 red blood cells
#> 10 B1 blood cells, red
#> 11 B1 red blood cells
#> 12 B1 erythrocytes
#> 13 A1 broken bones
#> 14 C1 liver disease
#> 15 C1 liver disease
#> 16 C1 hepatic illness
#> 17 C1 liver disease
#> 18 C1 disease, liver
#> 19 C1 liver disease
问题中给出的数据是可复制的格式:
df <- structure(list(CODE = c("A1", "A1", NA, "A1", "A1", "A1", "B1",
"B1", "B1", "B1", "B1", NA, "C1", "C1", "C1", "C1", "C1", NA),
STRING = c("broken bones", "broken bones", "broken bones",
"bones, broken", "bones, fracture", NA, "red blood cells",
"red blood cells", "blood cells, red", NA, "erythrocytes",
"broken bones", "liver disease", "liver disease", "hepatic illness",
NA, "disease, liver", "liver disease")), row.names = 2:19, class = "data.frame")