我分析了大量推文,然后尝试对其进行分析。我在一个数据帧中有一条推文,其中每个单元格都有一条推文的内容(例如“我爱旧金山”和“空军的骄傲的一员”)。但是,当我在网络可视化中分析文本时,每个生物中都有一些单词应该合并。我还想结合常见的两个单词的短语(例如“纽约”,“旧金山”和“空军”)。我已经编译了需要合并的术语列表,并使用gsub
将其中一些与下面的代码行结合在一起:
twitterdata_cleaning$bio = gsub('air force','airforce',twitterdata_cleaning$bio)
上面的代码行将"proud member of the air force"
转换为"proud member of the airforce"
。我已经能够成功使用数十个两个单词的短语来做到这一点。
但是,我在履历中有数百个两个单词的短语,并且我想更好地跟踪它们,因此我将所有这些术语移到了excel文件中的两列中。我想找到一种在txt或excel文件上使用上述公式的方法,该方法可以识别数据框中的术语,使其看起来像txt文件的第一栏中的内容,并将单词更改为类似于txt文件的第二栏中的内容txt文件。
例如,我有如下所示的xlsx和txt文件:
**column1** **column2*
san francisco sanfrancisco
new york newyork
las vegas lasvegas
san diego sandiego
new hampshire newhampshire
good bye goodbye
air force airforce
video game videogame
high school school
middle school school
elementary school school
我想在公式中使用gsub
命令,该公式在数据帧中搜索column 1
中的所有术语,并使用类似以下内容的方式将它们转换为column 2
中的术语:
twitterdata_df$tweet = gsub('textfile$column1','textfile$columnb',twitterdata_df$tweet)
要在单元格中获得类似的内容:
i love sanfrancisco
can not wait to go to newyork
what happens in lasvegas stays there
at the beach in sandiego
can beat the autumn leave in newhampshire
so done with all the drama goodbye
proud member of the airforce
love this videogame so much
playing at the school tonight
so sick of school
school was the best and i miss it
任何帮助将不胜感激。
您可以从程序包str_replace_all()
向stringr
中输入命名向量来完成此操作。在我的示例中,df
的一列包含old
值,将其替换为new
值。我认为这是通过Excel文件跟踪它们的意思。
library(stringr)
df <- data.frame(old = c("five", "six", "seven"),
new = as.character(5:7),
stringsAsFactors = FALSE)
text <- c("I am a vector with numbers six and other text five",
"another vector seven six text five")
str_replace_all(text, setNames(df$new, df$old))
结果:
[1] "I am a vector with numbers 6 and other text 5" "another vector 7 6 text 5"
数据
读入带有替换项的文本文件。
textfile <- read.csv(textConnection("column1, column2
san francisco, sanfrancisco
new york, newyork
las vegas, lasvegas
san diego, sandiego
new hampshire, newhampshire
good bye, goodbye
air force, airforce
video game, videogame
high school, school
middle school, school
elementary school, school"), stringsAsFactors = FALSE)
在tweet
列中加载带有推文的数据框。
twitterdata_df <- data.frame(id = 1:11)
twitterdata_df$tweet <- c("i love san francisco",
"can not wait to go to new york",
"what happens in las vegas stays there",
"at the beach in san diego",
"can beat the autumn leave in new hampshire",
"so done with all the drama goodbye",
"proud member of the air force",
"love this video game so much",
"playing at the high school tonight",
"so sick of middle school",
"elementary school was the best and i miss it")
替换
twitterdata_df$tweet2 <- str_replace_all(twitterdata_df$tweet, setNames(textfile$column2, textfile$column1))
结果
如您所见,替换是在tweet2
中进行的。
id tweet tweet2
1 1 i love san francisco i love sanfrancisco
2 2 can not wait to go to new york can not wait to go to newyork
3 3 what happens in las vegas stays there what happens in lasvegas stays there
4 4 at the beach in san diego at the beach in sandiego
5 5 can beat the autumn leave in new hampshire can beat the autumn leave in newhampshire
6 6 so done with all the drama goodbye so done with all the drama goodbye
7 7 proud member of the air force proud member of the airforce
8 8 love this video game so much love this videogame so much
9 9 playing at the high school tonight playing at the school tonight
10 10 so sick of middle school so sick of school
11 11 elementary school was the best and i miss it school was the best and i miss it