当我为一个列表手动输入unnest_tokens时,输出包括每个单词的行号。
library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)
#test data
text<- c( "furloughs","Working MORE for less pay", "total burnout and exhaustion")
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
结果看起来像这样,这就是我想要的。
row_name word
<int> <chr>
1 1 furlough
2 2 work
3 2 more
4 2 for
5 2 less
6 2 pai
7 3 total
8 3 burnout
9 3 and
10 3 exhaust
但是当我尝试从csv文件中读取真实的响应时,结果是这样的。
#Import data
text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
但在其他情况下,使用相同的代码
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
我得到了整个token列表分配到第1行 然后又分配到第2行,以此类推。
row_name word
<int> <chr>
1 1 c
2 1 furlough
3 1 work
4 1 more
5 1 for
6 1 less
7 1 pai
8 1 total
9 1 burnout
10 1 and
或者,如果我把mutate(row_name = row_number)移到unnest命令之后,我得到了每个token的行号。
word row_name
<chr> <int>
1 c 1
2 furlough 2
3 work 3
4 more 4
5 for 5
6 less 6
7 pai 7
8 total 8
9 burnout 9
10 and 10
我缺少了什么?
我想如果你用 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
, 文字 是一个数据框,而如果你手动输入,则是一个矢量。
如果你把代码改成: text_df <- tibble(text = text$col_name)
在csv的情况下,从数据框中选择列(这是一个向量),我想你应该得到和以前一样的结果。