我正在尝试将带有水印的内容复制到水印,问题是水印无法清除状态,我的代码是:
def main(args: Array[String]): Unit = {
@transient lazy val log = LogManager.getRootLogger
val spark = SparkSession
.builder
.master("local[2]")
.appName("RateResource")
.getOrCreate()
import spark.implicits._
val rateData: DataFrame = spark.readStream.format("rate").load()
val transData = rateData
.select($"timestamp" as "wtimestamp",$"value", $"value"%1000%100%10 as "key",$"value"%1000%100/10%2 as "dkey")
.where("key=0")
val selectData =transData
.withWatermark("wtimestamp","20 seconds") //
.dropDuplicates("dkey","wtimestamp")
val query = selectData.writeStream
.outputMode("update")
.format("console")
.option("truncate", "false")
.start()
query.awaitTermination()
}
和输入记录:
2017-08-09 10:00:10,10
2017-08-09 10:00:20,20
2017-08-09 10:00:30,10
2017-08-09 10:00:10,10
2017-08-09 11:00:30,40
2017-08-09 10:00:10,10
然后第一个“ 2017-08-09 10:00:10,10”可以输出,第二个“ 2017-08-09 10:00:10,10”10秒后无法输出。
-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:10|10 |0.0|1.0 |
+-------------------+-----+---+----+
-------------------------------------------
Batch: 2
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 3
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:20|20 |0.0|0.0 |
+-------------------+-----+---+----+
-------------------------------------------
Batch: 4
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 5
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:30|10 |0.0|1.0 |
+-------------------+-----+---+----+
-------------------------------------------
Batch: 6
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 7
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 8
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 11:00:30|40 |0.0|0.0 |
+-------------------+-----+---+----+
我通过在窗口中使用maxevent-time知道水印删除状态,但是在重复复制中,我不知道它如何清除状态?
运算符dropduplicate通过水印清除状态。作为您的代码,删除重复之前的最新水印为20秒。因此,spark会将所有数据从当前最大时间保留到20秒后退,这意味着该数据将与最近20分钟的数据进行比较,并且较旧的数据将被清除。