我需要从我的数据框中删除停用词。但我没有得到预期的结果。请找到以下代码:
//input
val inputDF = Seq(("test1 ab ac"),("test2 ab"),("test3 ab ac ad")).toDF("input")
//stopwords - mentioned sample stopwords. In realtime, having more than 100 stopwords
val stopwords = Array( "ab", "ab ac", "ab ac cd")
//Scala Function to remove stop words
val removeStopwords = spark.udf.register("removeStopwords", (input: String) => {
var result = input
val lowercaseStopwords = stopwords.map(_.toLowerCase())
lowercaseStopwords.foreach(stopword => {
result = result.replaceAll("(?i)\\b" + stopword + "\\b", "").trim()
})
result
})
val outputDF = inputDF.withColumn("output", removeStopwords($"input"))
outputDF.show(false)
| Input | Output |
| ------------- | ------------ |
| test1 ab ac | test1 ac |
| test2 ab | test2 |
| test3 ab ac ad | test3 ac ad |
但是预期的输出是:
| Input | Output |
| ------------- | ------------ |
| test1 ab ac | test1 |
| test2 ab | test2 |
| test3 ab ac ad | test3 |
你能帮我解决这个问题吗?