Spark Scala [for if-else嵌入的循环]我如何不能接收重复数组

问题描述 投票:0回答:1

我正在尝试计算数组RDD级别中的某些单词。它几乎完成了一半。但是,结果显示出与我要查找的结果不完全相同。

我正在处理葡萄酒评论评论,例如

var aa = dataset.map(c => c(2))

Array[String] = Array("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, "Ripe aromas of fig, "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, "This spent 20 months in 30% new French oak, "This is the top wine from La Bégude, "Deep,

我正在尝试计算列表中某些单词的数量

var positive_list= List( "tremendously","delicious")
var sum=0

var rr=aa.map(column =>
                 for (i <- positive_list) yield { 
                    if(column.contains(i)){
                      sum=sum+1
                      (column,sum)
                    } else {
                      (column,0)
                    }
                 })

rr.take(50)

结果:

Array(List(("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0), ("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0)), List(("Ripe aromas of fig,0), ("Ripe aromas of fig,0)), List(("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,1), ("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,2))

如您所见。我确实不需要一些重复的清单。我知道这是因为[yield]每次都会在循环中返回结果,但是我无法删除它,否则我将在列表中什么也没有。

有什么我可以做的吗?

arrays scala apache-spark rdd
1个回答
1
投票

对于positive_list中的每个元素,您正在使用for循环创建记录。我假设您想将您的评论映射到它包含的肯定词的数量(因此每个评论只有一条记录)。您可以通过使用count上的positive_list来做到这一点:

var rr=aa.map(column => column -> positive_list.count(column.contains))
© www.soinside.com 2019 - 2024. All rights reserved.