正则表达式在 Athena 中工作,但在 Spark SQL 中不起作用,即使在转义字符后也是如此

问题描述 投票:0回答:1

我正在使用 Spark SQL 将 duration 转换为秒。这是我在 Athena 中尝试过的,效果很好。

雅典娜代码:

`SELECT regexp_extract(duration, '^P(?!$)(\d+(?:\.\d+)?Y)?(\d+(?:\.\d+)?M)?(\d+(?:\.\d+)?W)?(\d+(?:\.\d+)?D)?(T(?=\d)(\d+(?:\.\d+)?H)?(\d+(?:\.\d+)?M)?(\d+(?:\.\d+)?S)?)?$') as sec 
FROM table 1 where duration = 'PT0.46S'`

输出:0.46S

即使使用反斜杠“”,相同的代码也会返回 null。我也尝试过在 select 子句之前使用“r”,但它不起作用。

我的方法需要改变什么才能使其在 Spark SQL 中工作

Spark SQL 代码:

> spark.sql("SELECT regexp_extract(duration,
> '^P(?!$)(\\\\d+(?:\\\\.\\\\d+)?Y)?(\\\\d+(?:\\\\.\\\\d+)?M)?(\\\\d+(?:\\\\.\\\\d+)?W)?(\\\\d+(?:\\\\.\\\\d+)?D)?(T(?=\\\\d)(\\\\d+(?:\\\\.\\\\d+)?H)?(\\\\d+(?:\\\\.\\\\d+)?M)?(\\\\d+(?:\\\\.\\\\d+)?S)?)?$')
> as sec FROM table where duration = 'PT0.46S'").show()

输出:空

regex apache-spark pyspark apache-spark-sql amazon-athena
1个回答
0
投票

看起来您只需要指定正确的捕获组索引,对于您的正则表达式来说是 8:

val pattern = "^P(?!$)(\\d+(?:\\.\\d+)?Y)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?W)?(\\d+(?:\\.\\d+)?D)?(T(?=\\d)(\\d+(?:\\.\\d+)?H)?(\\d+(?:\\.\\d+)?M)?(\\d+(?:\\.\\d+)?S)?)?$"

val data = Seq(
    ("good", "PT0.46S"),
    ("good", "P23DT23H")
)

val schema = StructType(Seq(
    StructField("Id", StringType, nullable = false),
    StructField("Val", StringType, nullable = false)
))

val rowsRDD = spark.sparkContext.parallelize(data.map { case (id, date) => Row(id, date) })

// Create DataFrame
var df = spark.createDataFrame(rowsRDD, schema)

df = df.withColumn("Extracted", F.regexp_extract(F.col("Val"), pattern, 8))

// Show DataFrame
df.show()

结果是:

+----------+-------+---------+
|        Id|    Val|Extracted|
+----------+-------+---------+
|your stuff|PT0.46S|    0.46S|
+----------+-------+---------+
© www.soinside.com 2019 - 2024. All rights reserved.