配对gsub函数和文本文件进行语料库清理

问题描述 投票:2回答:1

我分析了大量推文,然后尝试对其进行分析。我在一个数据帧中有一条推文,其中每个单元格都有一条推文的内容(例如“我爱旧金山”和“空军的骄傲的一员”)。但是,当我在网络可视化中分析文本时,每个生物中都有一些单词应该合并。我还想结合常见的两个单词的短语(例如“纽约”,“旧金山”和“空军”)。我已经编译了需要合并的术语列表,并使用gsub将其中一些与下面的代码行结合在一起:

twitterdata_cleaning$bio = gsub('air force','airforce',twitterdata_cleaning$bio)

上面的代码行将"proud member of the air force"转换为"proud member of the airforce"。我已经能够成功使用数十个两个单词的短语来做到这一点。

但是,我在履历中有数百个两个单词的短语,并且我想更好地跟踪它们,因此我将所有这些术语移到了excel文件中的两列中。我想找到一种在txt或excel文件上使用上述公式的方法,该方法可以识别数据框中的术语,使其看起来像txt文件的第一栏中的内容,并将单词更改为类似于txt文件的第二栏中的内容txt文件。

例如,我有如下所示的xlsx和txt文件:

    **column1**               **column2*
   san francisco              sanfrancisco
     new york                   newyork
     las vegas                  lasvegas
     san diego                  sandiego
   new hampshire              newhampshire
      good bye                   goodbye
      air force                  airforce
     video game                 videogame
    high school                  school
    middle school                school
    elementary school            school

我想在公式中使用gsub命令,该公式在数据帧中搜索column 1中的所有术语,并使用类似以下内容的方式将它们转换为column 2中的术语:

twitterdata_df$tweet = gsub('textfile$column1','textfile$columnb',twitterdata_df$tweet)

要在单元格中获得类似的内容:

i love sanfrancisco
can not wait to go to newyork
what happens in lasvegas stays there
at the beach in sandiego
can beat the autumn leave in newhampshire
so done with all the drama goodbye
proud member of the airforce
love this videogame so much
playing at the school tonight 
so sick of school
school was the best and i miss it

任何帮助将不胜感激。

r text text-files gsub data-cleaning
1个回答
3
投票

通用解决方案

您可以从程序包str_replace_all()stringr中输入命名向量来完成此操作。在我的示例中,df的一列包含old值,将其替换为new值。我认为这是通过Excel文件跟踪它们的意思。

library(stringr)

df <- data.frame(old = c("five", "six", "seven"),
                 new = as.character(5:7),
                 stringsAsFactors = FALSE)

text <- c("I am a vector with numbers six and other text five",
          "another vector seven six text five")

str_replace_all(text, setNames(df$new, df$old))

结果:

[1] "I am a vector with numbers 6 and other text 5" "another vector 7 6 text 5" 

特定示例

数据

读入带有替换项的文本文件。

textfile <- read.csv(textConnection("column1, column2
san francisco, sanfrancisco
new york, newyork
las vegas, lasvegas
san diego, sandiego
new hampshire, newhampshire
good bye, goodbye
air force, airforce
video game, videogame
high school, school
middle school, school
elementary school, school"), stringsAsFactors = FALSE)

tweet列中加载带有推文的数据框。

twitterdata_df <- data.frame(id = 1:11)
twitterdata_df$tweet <- c("i love san francisco",
                          "can not wait to go to new york",
                          "what happens in las vegas stays there",
                          "at the beach in san diego",
                          "can beat the autumn leave in new hampshire",
                          "so done with all the drama goodbye",
                          "proud member of the air force",
                          "love this video game so much",
                          "playing at the high school tonight",
                          "so sick of middle school",
                          "elementary school was the best and i miss it")

替换

twitterdata_df$tweet2 <- str_replace_all(twitterdata_df$tweet, setNames(textfile$column2, textfile$column1))

结果

如您所见,替换是在tweet2中进行的。

   id                                        tweet                                     tweet2
1   1                         i love san francisco                       i love  sanfrancisco
2   2               can not wait to go to new york             can not wait to go to  newyork
3   3        what happens in las vegas stays there      what happens in  lasvegas stays there
4   4                    at the beach in san diego                  at the beach in  sandiego
5   5   can beat the autumn leave in new hampshire can beat the autumn leave in  newhampshire
6   6           so done with all the drama goodbye         so done with all the drama goodbye
7   7                proud member of the air force              proud member of the  airforce
8   8                 love this video game so much               love this  videogame so much
9   9           playing at the high school tonight             playing at the  school tonight
10 10                     so sick of middle school                         so sick of  school
11 11 elementary school was the best and i miss it          school was the best and i miss it
© www.soinside.com 2019 - 2024. All rights reserved.