在整洁的文本中保留文件编号

问题描述 投票:0回答:1

当我为一个列表手动输入unnest_tokens时,输出包括每个单词的行号。

library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)


#test data
text<- c( "furloughs","Working MORE for less pay",  "total burnout and exhaustion")

#break text file into single words and list which row they are in
  text_df <- tibble(text = text)

  tidy_text <- text_df %>% 
    mutate_all(as.character) %>% 
    mutate(row_name = row_number())%>%    
    unnest_tokens(word, text) %>%
    mutate(word = wordStem(word))

结果看起来像这样,这就是我想要的。

   row_name word    
      <int> <chr>   
 1        1 furlough
 2        2 work    
 3        2 more    
 4        2 for     
 5        2 less    
 6        2 pai     
 7        3 total   
 8        3 burnout 
 9        3 and     
10        3 exhaust

但是当我尝试从csv文件中读取真实的响应时,结果是这样的。

#Import data  
 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)

但在其他情况下,使用相同的代码

#break text file into single words and list which row they are in
  text_df <- tibble(text = text)

  tidy_text <- text_df %>% 
    mutate_all(as.character) %>% 
    mutate(row_name = row_number())%>%

    unnest_tokens(word, text) %>%

    mutate(word = wordStem(word)) 

我得到了整个token列表分配到第1行 然后又分配到第2行,以此类推。

   row_name word    
      <int> <chr>   
 1        1 c       
 2        1 furlough
 3        1 work    
 4        1 more    
 5        1 for     
 6        1 less    
 7        1 pai     
 8        1 total   
 9        1 burnout 
10        1 and   

或者,如果我把mutate(row_name = row_number)移到unnest命令之后,我得到了每个token的行号。

   word     row_name
   <chr>       <int>
 1 c               1
 2 furlough        2
 3 work            3
 4 more            4
 5 for             5
 6 less            6
 7 pai             7
 8 total           8
 9 burnout         9
10 and            10

我缺少了什么?

r row-number tidytext unnest
1个回答
0
投票

我想如果你用 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE), 文字 是一个数据框,而如果你手动输入,则是一个矢量。

如果你把代码改成: text_df <- tibble(text = text$col_name) 在csv的情况下,从数据框中选择列(这是一个向量),我想你应该得到和以前一样的结果。

© www.soinside.com 2019 - 2024. All rights reserved.