我有一个文章标题和我想基于匹配的词进行分类文摘数据集。
“这是我想基于被从列表中匹配的字词进行分类文本的例子这将是大约2 - 3句子长word4,的word5,文本,文本,文本”
Topic 1 Topic 2 Topic (X)
word1 word4 word(a)
word2 word5 word(b)
word3 word6 word(c)
鉴于该文本以上主题2相匹配的话,我想这个标签分配一个新列。优选,这可能是与“整洁的诗句”包来完成。
鉴于这句话作为一个字符串,并以数据帧的主题,你可以做这样的事情
input<- c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text")
df <- data.frame(Topic1 = c("word1", "word2", "word3"),Topic2 = c("word4", "word5", "word6"))
## This splits on space and punctation (only , and .)
input<-unlist(strsplit(input, " |,|\\."))
newcol <- paste(names(df)[apply(df,2, function(x) sum(input %in% x) > 0)], collapse=", ")
由于我不能确定数据帧的要添加这也我已经做了矢量NEWCOL。
如果你有长句的数据帧,那么你可以使用类似的方法。
inputdf<- data.frame(title=c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text", "word2", "word3, word4"))
input <- strsplit(as.character(inputdf$title), " |,|\\.")
inputdf$newcolmn <-unlist(lapply(input, function(x) paste(names(df)[apply(df,2, function(y) sum(x %in% y)>0)], collapse = ", ")))