我正在尝试计算数组RDD级别中的某些单词。它几乎完成了一半。但是,结果显示出与我要查找的结果不完全相同。
我正在处理葡萄酒评论评论,例如
var aa = dataset.map(c => c(2))
Array[String] = Array("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, "Ripe aromas of fig, "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, "This spent 20 months in 30% new French oak, "This is the top wine from La Bégude, "Deep,
我正在尝试计算列表中某些单词的数量
var positive_list= List( "tremendously","delicious")
var sum=0
var rr=aa.map(column =>
for (i <- positive_list) yield {
if(column.contains(i)){
sum=sum+1
(column,sum)
} else {
(column,0)
}
})
rr.take(50)
结果:
Array(List(("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0), ("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0)), List(("Ripe aromas of fig,0), ("Ripe aromas of fig,0)), List(("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,1), ("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,2))
如您所见。我确实不需要一些重复的清单。我知道这是因为[yield]每次都会在循环中返回结果,但是我无法删除它,否则我将在列表中什么也没有。
有什么我可以做的吗?
对于positive_list
中的每个元素,您正在使用for循环创建记录。我假设您想将您的评论映射到它包含的肯定词的数量(因此每个评论只有一条记录)。您可以通过使用count
上的positive_list
来做到这一点:
var rr=aa.map(column => column -> positive_list.count(column.contains))