我正在使用 Spark SQL 将 duration 转换为秒。这是我在 Athena 中尝试过的,效果很好。
雅典娜代码:
`SELECT regexp_extract(duration, '^P(?!$)(\d+(?:\.\d+)?Y)?(\d+(?:\.\d+)?M)?(\d+(?:\.\d+)?W)?(\d+(?:\.\d+)?D)?(T(?=\d)(\d+(?:\.\d+)?H)?(\d+(?:\.\d+)?M)?(\d+(?:\.\d+)?S)?)?$') as sec
FROM table 1 where duration = 'PT0.46S'`
即使使用反斜杠“”,相同的代码也会返回 null。我也尝试过在 select 子句之前使用“r”,但它不起作用。
我的方法需要改变什么才能使其在 Spark SQL 中工作
Spark SQL 代码:
> spark.sql("SELECT regexp_extract(duration,
> '^P(?!$)(\\\\d+(?:\\\\.\\\\d+)?Y)?(\\\\d+(?:\\\\.\\\\d+)?M)?(\\\\d+(?:\\\\.\\\\d+)?W)?(\\\\d+(?:\\\\.\\\\d+)?D)?(T(?=\\\\d)(\\\\d+(?:\\\\.\\\\d+)?H)?(\\\\d+(?:\\\\.\\\\d+)?M)?(\\\\d+(?:\\\\.\\\\d+)?S)?)?$')
> as sec FROM table where duration = 'PT0.46S'").show()
看起来您只需要指定正确的捕获组索引,对于您的正则表达式来说是 8:
val pattern = "^P(?!$)(\\d+(?:\\.\\d+)?Y)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?W)?(\\d+(?:\\.\\d+)?D)?(T(?=\\d)(\\d+(?:\\.\\d+)?H)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?S)?)?$"
val data = Seq(
("good", "PT0.46S"),
("good", "P23DT23H")
)
val schema = StructType(Seq(
StructField("Id", StringType, nullable = false),
StructField("Val", StringType, nullable = false)
))
val rowsRDD = spark.sparkContext.parallelize(data.map { case (id, date) => Row(id, date) })
// Create DataFrame
var df = spark.createDataFrame(rowsRDD, schema)
df = df.withColumn("Extracted", F.regexp_extract(F.col("Val"), pattern, 8))
// Show DataFrame
df.show()
结果是:
+----------+-------+---------+
| Id| Val|Extracted|
+----------+-------+---------+
|your stuff|PT0.46S| 0.46S|
+----------+-------+---------+