在R中的数组中对tf-idf得分进行排名

Question

我写了以下函数来确定文档的tf-idf：

确定tf

tf <- function(specific_word, text){
 count = 0
 list = unlist(strsplit(text, " "))

 for(word in (list)){
  if(word == specific_word){
   count = count + 1
   }
  }
 hit_rate <- count/length(list)
 return(hit_rate)
}

确定idf值

idf <- function(specific_word, texts){

  times_a_word_appears <- 0
  total_number_of_documents <- length(texts)

  for(document in texts){
    list = strsplit(document, " ")
    list = unlist(list)

    for(word in list){
      if(word == specific_word){
        times_a_word_appears = times_a_word_appears + 1
        break
      }
    }

  }
  times_a_word_appears = times_a_word_appears + 1

  idf = log(total_number_of_documents/ times_a_word_appears)
  return(idf)
 }

最后 - 确定tf-idf

tfidf <- function(specific_word, text, texts){

  x = tf(specific_word, text)
  y = idf(specific_word, texts)
  z = x * y

   print(paste0("The tf-idf value is: ", z))
}

我现在可以使用它来确定这些文件的tf-idf值：

document1 = c("films is a 2000 made-for-TV horror movie directed by Richard Clabaugh. The film features several cult favorite actors, including William Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean Whalen. The film concerns a genetically engineered snake, a python, that escapes and unleashes itself on a small town. It includes the classic final girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
California and Malibu, California. Python was followed by two sequels: Python II (2002) and Boa vs. Python (2004), both also made-for-TV films")

document2 = c("Python, from the Greek word, is a genus of nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are recognised.[2] A member of this genus, P. reticulatus, is among the longest snakes known.")

document3 = c("The Colt Python is a .357 Magnum caliber revolver formerly manufactured by Colt's Manufacturing Company of Hartford, Connecticut. It is sometimes referred to as a Combat Magnum It was first introduced in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy Thompson, Renee Smeets and Martin Dougherty have described the Python as the finest production revolver ever made")

texts = c(document1, document2, document3)

并在document1中找到“films”的tf-idf值

word = "films"
relevant_text = document1
tfidf(word, relevant_text, texts)

然而，我现在想要的是循环所有文档中的所有单词以确定文档的最高评级单词。

所以对于文档1有点像：

words = unlist(unique(strsplit(document1, " ")))

for(word in words){
  tfidf(word, document1, texts)
  }

但是这些值应该存储在一个数组中并进行排名。在python中有点像这样：

scores = {word: tfidf(word, document1, texts) for word in document1.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)

关于如何在R中最有效地完成这项工作的任何想法？

Answer 1

我建议你看一下CRAN Task View: Natural Language Processing上列出的软件包。有几个包用于创建文档术语矩阵的整个过程，包括规范化或tfidf加权。他们的小插图还展示了许多下游任务，如regession模型或用于分类，主题建模等的聚类。

下面我使用了其中一个软件包，即text2vec来解决创建tfidf加权文档术语矩阵的任务。

我希望有所帮助。

library(text2vec)

#with your docuemnts...
#texts = c(document1, document2, document3)

#create iterator to split texts into tokens
iterator <- itoken(texts, 
                  preprocessor = tolower, 
                  tokenizer = word_tokenizer, 
                  progressbar = FALSE)

#create the vocabulary of tokens
vocabulary <- create_vocabulary(iterator)

#combine tokens into a document term matrix
#this will be a sparse matrix (see the package "Matrix" for details)
#you might need to convert your dtm objects with as.matrix() to "normal" matrices 
#depending on your downstream task (although most packages accept sparse matrices)
#note, that when converting with as.matrix(), you will loose the memory of advantage of sparse matrices
dtm <- create_dtm(iterator, vocab_vectorizer(vocabulary))
str(dtm)
# Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
# ..@ i       : int [1:192] 1 2 2 0 2 2 0 1 2 0 ...
# ..@ p       : int [1:173] 0 1 2 3 4 5 6 7 8 9 ...
# ..@ Dim     : int [1:2] 3 172
# ..@ Dimnames:List of 2
# .. ..$ : chr [1:3] "1" "2" "3"
# .. ..$ : chr [1:172] "recognised" "premium" "finest" "william" ...
# ..@ x       : num [1:192] 1 1 1 1 1 1 1 1 1 1 ...
# ..@ factors : list()

#set up basic tfidf model
tfidf <- TfIdf$new()

#apply model to your dtm
dtm_tfidf <-  fit_transform(dtm, tfidf)

str(dtm_tfidf)
# Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
# ..@ i       : int [1:192] 1 2 2 0 2 2 0 1 2 0 ...
# ..@ p       : int [1:173] 0 1 2 3 4 5 6 7 8 9 ...
# ..@ Dim     : int [1:2] 3 172
# ..@ Dimnames:List of 2
# .. ..$ : chr [1:3] "1" "2" "3"
# .. ..$ : chr [1:172] "recognised" "premium" "finest" "william" ...
# ..@ x       : num [1:192] 0.01126 0.00461 0.00461 0.00322 0.00461 ...
# ..@ factors : list()

在R中的数组中对tf-idf得分进行排名

问题描述投票：0回答：1

1个回答

最新问题

在R中的数组中对tf-idf得分进行排名

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1