我有一个“实际”前 25 位作者的列表,我想将其与预测的前 25 位作者列表进行比较。我想比较与“实际”作者相同的预测作者的比例以及列表的顺序。最好的方法是什么?
以下示例数据:
actual_authors <- list("Tom", "Dick", "Harry", "Edward", "Fred")
predicted_authors <- list("Ian", "Liam", "Harry", "Toby", "Tom")
总体而言,预测列表包含实际列表的 40%,但是我还想查看列表的顺序 - Harry 和 Tom 都在预测列表中,但他们的位置与实际列表中的位置不同。如果可能的话,有%相似度就好了。
最好的方法是什么?
您正在寻找这样的东西吗?请参阅内嵌注释,了解代码功能的分步说明。
这是一个非常基本的解决方案,只有在两个列表长度相等时才有效。
library(dplyr)
library(magrittr)
library(stringr)
library(tidyr)
#Your data.
actual_authors <- list("Tom", "Dick", "Harry", "Edward", "Fred")
predicted_authors <- list("Ian", "Liam", "Harry", "Toby", "Tom")
#Put everything into a data.frame.
df <- data.frame(act = unlist(actual_authors),
pre = unlist(predicted_authors),
stringsAsFactors = FALSE)
#Store positions of the names as an own column.
df %<>% mutate(ord = row_number())
#Pivot the data longer to get the source of the name into a column ("cat")
#and the name itself into another ("val")
df %<>%
pivot_longer(cols = -ord, names_to = "cat", values_to = "val")
#Group the data by the names
df %<>% group_by(val) %>%
mutate(id = ifelse(n() == 2, 1, 0), #Set a score of 1 if the names appear in both lists (i.e., both "cats")
pos = ifelse((n_distinct(ord) == 1) & n() > 1, 1, 0) #Set a score of 1 if the names appear in the same positions (i.e., same "ord")
)
#Calculate the similarity score as the sum of the "id" and "pos" scores calculated above
#divided by the maximum possible score for the data at hand (all "id" and "pos" have a value of 1).
simscore <- ((sum(df$id) + sum(df$pos)) / (2*nrow(df)))
simscore
#0.3
#Scale by 100 to get a percentage.
simscore <- simscore*100
simscore
#30