我有一个带有多个R * C的数据帧。我想创建一个条件循环,它将检查列中最长的字符串并将其标记为group1,然后将其内容与列中的其他字符串进行比较以进行匹配。条件是如果列中存在与最长字符串匹配的任何字符串,则它将被标记为组1.如果存在任何新元素,则它将被标记为下一组。
STR "G,D,E","F" "D,E,F","G" "D,E","F" "A,B","C" "C","D" "A","B"
输出应该像:
STR Group "G,D,E","F" 1 "D,E,F","G" 1 "D,F","E" 1 "D,E","F" 1 "A,B","C" 2 "C","D" 3 "A","B" 2
这是第一次。我假设给定的数据结构是一个包含两列的数据框,因为它似乎在您的示例中。这是我做的:
str_1 <- c("G,D,E", "D,E,F", "D,E", "A,B", "C", "A")
str_2 <- c("F", "G", "F", "C", "D", "B")
str_df <- data.frame(str_1, str_2)
merged_str <- paste(str_df[,1], str_df[,2], sep=",")
str_list <- strsplit(merged_str, ",")
is_in_group <- function(letters, group){
for (letter in letters) {
if(!(letter %in% group)) {
return(FALSE)
}
}
return(TRUE)
}
groups <- list()
groups[[1]] <- str_list[[1]]
group_vec <- rep(0, length(str_list))
group_vec[1] <- 1
for (i in 2:length(str_list)) {
curr_letters <- str_list[[i]]
new_group = TRUE
for (g in 1:length(groups)) {
if(is_in_group(curr_letters, groups[[g]])) {
group_vec[i] <- g
new_group = FALSE
break
}
}
if (new_group) {
groups[[length(groups)+1]] <- curr_letters
group_vec[i] <- length(groups)
}
}