使用 Scala 删除停用词

问题描述 投票:0回答:0

我需要从我的数据框中删除停用词。但我没有得到预期的结果。请找到以下代码:

//input 
val inputDF = Seq(("test1 ab ac"),("test2 ab"),("test3 ab ac ad")).toDF("input")

//stopwords - mentioned sample stopwords. In realtime, having more than 100 stopwords
val stopwords = Array( "ab", "ab ac", "ab ac cd")

//Scala Function to remove stop words
val removeStopwords = spark.udf.register("removeStopwords", (input: String) => {
  var result = input
  val lowercaseStopwords = stopwords.map(_.toLowerCase())  
  lowercaseStopwords.foreach(stopword => {
    result = result.replaceAll("(?i)\\b" + stopword + "\\b", "").trim()  
  })
  result
})

val outputDF = inputDF.withColumn("output", removeStopwords($"input"))
outputDF.show(false)

| Input          | Output       |
| -------------  | ------------ |
| test1 ab ac    | test1  ac    |
| test2 ab       | test2        |
| test3 ab ac ad | test3  ac ad |

但是预期的输出是:

| Input          | Output       |
| -------------  | ------------ |
| test1 ab ac    | test1        |
| test2 ab       | test2        |
| test3 ab ac ad | test3        |

你能帮我解决这个问题吗?

scala apache-spark user-defined-functions stop-words scala-spark
最新问题
© www.soinside.com 2019 - 2024. All rights reserved.