在R中,可以从相应的代码中插入字符串,反之亦然吗?

问题描述 投票:0回答:1

我有一个具有代码字符串对的数据集。换句话说,有一列代码和对应的一列字符串是这些代码的描述。

问题是三重...

  • 问题1:有时缺少代码,但存在字符串。
  • 问题2:有时存在代码,但缺少字符串
  • 问题3:有时,相同代码的字符串是不同的字符串,但表示相同的内容(“同义词”)。

组成示例:

 +-------+------------------+
 |  CODE |  STRING          |
 +-------+------------------+
 |   A1  |  broken bones    |
 |   A1  |  broken bones    |
 |   NA  |  broken bones    |
 |   A1  |  bones, broken   |
 |   A1  |  bones, fracture |
 |   A1  |  NA              |
 |   B1  |  red blood cells |
 |   B1  |  red blood cells |
 |   B1  |  blood cells, red|
 |   B1  |  NA              |
 |   B1  |  erythrocytes    |
 |   NA  |  broken bones    |
 |   C1  |  liver disease   |
 |   C1  |  liver disease   |
 |   C1  |  hepatic illness |
 |   C1  |  NA              |
 |   C1  |  disease, liver  |
 |   NA  |  liver disease   |
 +-------+------------------+ 

我的问题是...

  1. 如果存在代码,是否可以插补字符串?副版本也是。
  2. 如果代码字符串对的字符串不同但可以重复,可以推算它们吗? (例如肝病)
  3. 如果是这样,是否有一个R包可以进行这种估算?
r string package imputation
1个回答
0
投票

这里是一种方法:

get_commonest <- function(level, code, string)
{
  names(rev(sort(table(na.omit(string[code == level])))))[1]
}

codes <- na.omit(unique(df$CODE))
strings <- na.omit(unique(df$STRING))

default_strings <- as.data.frame(sapply(codes, get_commonest, df$CODE, df$STRING))
default_codes <- as.data.frame(sapply(strings, get_commonest, df$STRING, df$CODE))

df$CODE[is.na(df$CODE)] <- as.character(default_codes[df$STRING[is.na(df$CODE)],])
df$STRING[is.na(df$STRING)] <- as.character(default_strings[df$CODE[is.na(df$STRING)],])

为您提供此结果:

df
#>    CODE           STRING
#> 2    A1     broken bones
#> 3    A1     broken bones
#> 4    A1     broken bones
#> 5    A1    bones, broken
#> 6    A1  bones, fracture
#> 7    A1     broken bones
#> 8    B1  red blood cells
#> 9    B1  red blood cells
#> 10   B1 blood cells, red
#> 11   B1  red blood cells
#> 12   B1     erythrocytes
#> 13   A1     broken bones
#> 14   C1    liver disease
#> 15   C1    liver disease
#> 16   C1  hepatic illness
#> 17   C1    liver disease
#> 18   C1   disease, liver
#> 19   C1    liver disease

问题中给出的数据是可复制的格式:

df <- structure(list(CODE = c("A1", "A1", NA, "A1", "A1", "A1", "B1", 
"B1", "B1", "B1", "B1", NA, "C1", "C1", "C1", "C1", "C1", NA), 
    STRING = c("broken bones", "broken bones", "broken bones", 
    "bones, broken", "bones, fracture", NA, "red blood cells", 
    "red blood cells", "blood cells, red", NA, "erythrocytes", 
    "broken bones", "liver disease", "liver disease", "hepatic illness", 
    NA, "disease, liver", "liver disease")), row.names = 2:19, class = "data.frame")
© www.soinside.com 2019 - 2024. All rights reserved.