我有一个在 Presto 格式下运行良好的查询(在 Athena 中执行时)。但是,当我在 AWS Glue 中运行相同的查询(通过 Python Spark.SQL 数据帧)时,收到一条错误消息“AnalysisException:列 'metric_value' 不存在。您是指以下之一吗?[]; 第 10 行 pos 30”(它肯定存在)。仅当下面代码中存在交叉连接线时,才会出现此错误。如果我把这条线拿出来(在 athena/presto 中工作得很好),胶水工作也运行得很好。
metric_value 列中的值示例为 {14311, 242342134, 13132}
以下是查询:
df1 = spark.sql("""
select * FROM database.table
WHERE date(load_date) = current_date - interval '1' day
""")
df1.createOrReplaceTempView("table1")
df2 = spark.sql("""
SELECT load_date, load_hr, timestamp,
ne_name, object, metric_value,
MIN(CAST((CASE WHEN trim(split_part) = '' THEN NULL ELSE trim(split_part) END) AS DOUBLE))/100 AS MIN_UTIL,
MAX(CAST((CASE WHEN trim(split_part) = '' THEN NULL ELSE trim(split_part) END) AS DOUBLE))/100 AS
MAX_UTIL, AVG(CAST((CASE WHEN trim(split_part) = '' THEN NULL ELSE trim(split_part)
END) AS DOUBLE))/100 AS AVG_UTIL
FROM table1
cross join UNNEST(CAST(SPLIT(REGEXP_REPLACE(metric_value, '[{}]',''), ',') AS ARRAY<VARCHAR(132)>)) AS t(split_part)
WHERE metric_name = 'util'
group by load_date, load_hr, timestamp, ne_name, object_id, metric_value
""")
预期结果:
加载_日期 | 加载_小时 | 时间戳 | ne_name | 物体 | 度量值 | Min_UTIL | MAX_UTIL | AVG_UTIL |
---|---|---|---|---|---|---|---|---|
2023-08-10 | 15 | 2023-08-10T09:45 | AP1 | AP1.12.5 | {14311, 242342134, 13132} | 3.4 | 29.1 | 15.8 |
解决此问题的方法是完全取出交叉连接线并将其添加到 df1,将“unnest”函数更改为“explode”函数。下面的新代码在 pyspark sql AWS Glue 中执行。
df1 = spark.sql("""
select *, EXPLODE(CAST(SPLIT(REGEXP_REPLACE(metric_value, '[{}]',''), ',') AS ARRAY<VARCHAR(132)>)) AS raw_value
FROM database.table
WHERE date(load_date) = current_date - interval '1' day
""")
df1.createOrReplaceTempView("table1")
df2 = spark.sql("""
SELECT load_date, load_hr, timestamp,
ne_name, object, metric_value,
MIN(CAST((CASE WHEN trim(raw_value) = '' THEN NULL ELSE trim(raw_value) END) AS DOUBLE))/100 AS MIN_UTIL,
MAX(CAST((CASE WHEN trim(raw_value) = '' THEN NULL ELSE trim(raw_value) END) AS DOUBLE))/100 AS
MAX_UTIL, AVG(CAST((CASE WHEN trim(raw_value) = '' THEN NULL ELSE trim(raw_value)END) AS DOUBLE))/100 AS AVG_UTIL
FROM table1
WHERE metric_name = 'util'
group by load_date, load_hr, timestamp, ne_name, object_id, metric_value
""")