删除R中只出现一次且IDF较低的词。

问题描述 投票:0回答:1

我有一个数据框,里面有一列文字。我想做三个数据预处理步骤。

1)删除只出现一次的词2)删除反文档频率(IDF)低的词3)删除出现频率最高的词。

这是一个数据的例子。

head(stormfront_data$stormfront_self_content)

Output:

[1] "        , ,    stormfront!  thread       members  post  introduction,     \".\"     stumbled   white networking site,    reading & decided  register  account,      largest networking site     white brothers,  sisters!    read : : guidelines  posting - stormfront introduction  stormfront - stormfront  main board consists   forums,  -forums   : newslinks & articles - stormfront ideology  philosophy - stormfront activism - stormfront       network   local level: local  regional - stormfront international - stormfront  ,  .  addition   main board   supply  social groups    utilized  networking.  final note:      steps    sustaining member,  core member      site online,   affords  additional online features. sf: shopping cart   stormfront!"
[2] "bonjour      warm  brother !   forward  speaking     !"                                                                                                                      
[3] " check   time  time   forums.      frequently    moved  columbia   distinctly  numbered.    groups  gatherings         "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4] "  !  site  pretty nice.    amount  news articles.  main concern   moment  islamification."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[5] " , discovered  site   weeks ago.  finally decided  join   found  article  wanted  share  .   proud   race   long time    idea  site    people  shared  views existed."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[6] "  white brothers,  names jay      member   years,        bit  info    ?    stormfront meet ups     ? stay strong guys    jay, uk"                                                                                                           

任何帮助将是非常感激的, 因为我不是太熟悉R。

r nlp data-cleaning tf-idf word-frequency
1个回答
2
投票

下面是一个方法 tidytext

library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
  unnest_tokens(word, text) %>%
  count(document, word, sort = TRUE)

total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
  unnest_tokens(word, text) %>%
  group_by(word) %>% 
  summarize(total = n()) 

words <- left_join(word_count,total_count)

words %>%
  bind_tf_idf(word, document, n)
# A tibble: 111 x 7
   document word             n total     tf   idf tf_idf
      <int> <chr>        <int> <int>  <dbl> <dbl>  <dbl>
 1        1 stormfront      10    11 0.139  1.10  0.153 
 2        1 networking       3     3 0.0417 1.79  0.0747
 3        1 site             3     6 0.0417 0.693 0.0289
 4        1 board            2     2 0.0278 1.79  0.0498
 5        1 forums           2     3 0.0278 1.10  0.0305
 6        1 introduction     2     2 0.0278 1.79  0.0498
 7        1 local            2     2 0.0278 1.79  0.0498
 8        1 main             2     3 0.0278 1.10  0.0305
 9        1 member           2     3 0.0278 1.10  0.0305
10        1 online           2     2 0.0278 1.79  0.0498
# … with 101 more rows

从这里可以很容易的用 dplyr::filter但由于你除了 "只有一次 "之外,没有定义任何具体的标准,所以我把这个问题留给你。

数据

data <- structure(c("        , ,    stormfront!  thread       members  post  introduction,     \".\"     stumbled   white networking site,    reading & decided  register  account,      largest networking site     white brothers,  sisters!    read : : guidelines  posting - stormfront introduction  stormfront - stormfront  main board consists   forums,  -forums   : newslinks & articles - stormfront ideology  philosophy - stormfront activism - stormfront       network   local level: local  regional - stormfront international - stormfront  ,  .  addition   main board   supply  social groups    utilized  networking.  final note:      steps    sustaining member,  core member      site online,   affords  additional online features. sf: shopping cart   stormfront!", 
"bonjour      warm  brother !   forward  speaking     !", " check   time  time   forums.      frequently    moved  columbia   distinctly  numbered.    groups  gatherings         ", 
"  !  site  pretty nice.    amount  news articles.  main concern   moment  islamification.", 
" , discovered  site   weeks ago.  finally decided  join   found  article  wanted  share  .   proud   race   long time    idea  site    people  shared  views existed.", 
"  white brothers,  names jay      member   years,        bit  info    ?    stormfront meet ups     ? stay strong guys    jay, uk"
), .Dim = c(6L, 1L))

2
投票

下面是解决Q1的几个步骤。

第一步:清理数据,删除一切非字母数字的内容(\\W):

data2 <- trimws(paste0(gsub("\\W+", " ", data), collapse = ""))

第二步:对单词的频率进行排序列表。

fw <- as.data.frame(sort(table(strsplit(data2, "\\s{1,}")), decreasing = T))

第三步:定义一个匹配的模式(即所有只出现一次的单词),确保你把它们包进边界位置标记(\\b),这样只有完全匹配的才会被匹配(例如, network而不是 networking):

pattern <- paste0("\\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\\b")

第四步: 去掉匹配的词。

data3 <- gsub(pattern, "", data2)

第五步: 去掉多余的空格进行清理。

data4 <- trimws(gsub("\\s{1,}", " ", data3))

结果:

[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"

0
投票

基础R解决方案。

# Remove double spacing and punctuation at the start of strings: 
# cleaned_str => character vector
cstr <- trimws(gsub("\\s*[[:punct:]]+", "", trimws(gsub('\\s+|^\\s*[[:punct:]]+|"',
                    ' ', df), "both")), "both")

# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
  unique(unlist(strsplit(x, "[^a-z]+")))}))))

# Store the inverse document frequency as a vector: idf => double vector: 
document_freq$idf <- log(length(cstr)/document_freq$Freq)

# For each record remove terms that occur only once, occur the maximum number 
# of times a word occurs in the dataset, or words with a "low" idf: 
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
    # Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
    tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_, 
                                                           unlist(strsplit(x, "[^a-z]+")))))),
                           stringsAsFactors = FALSE)

    # Store a vector containing each term's idf: idf => double vector
    tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]

    # Explicitly return the ppd vector: .GlobalEnv() => character vector
    return(
      data.frame(
        cleaned_record = x,
        pp_records =
          paste0(unique(unlist(
            strsplit(gsub("\\s+", " ",
                          trimws(
                            gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
                                                        tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
                                                        tf_dataf$Freq == max(tf_dataf$Freq)],
                                        collapse = "|"), "", x), "both"
                          )), "\\s")
          )), collapse = " "),
        row.names = NULL,
        stringsAsFactors = FALSE
      )
    )
  }
))

# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame 
ppd_cleaned_df <- cbind(orig_record = df, pp_records)

# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df
© www.soinside.com 2019 - 2024. All rights reserved.