Unnest 和 split 函数在 pyspark SQL 中返回错误

问题描述 投票:0回答:1

我有一个在 Presto 格式下运行良好的查询(在 Athena 中执行时)。但是,当我在 AWS Glue 中运行相同的查询(通过 Python Spark.SQL 数据帧)时,收到一条错误消息“AnalysisException:列 'metric_value' 不存在。您是指以下之一吗?[]; 第 10 行 pos 30”(它肯定存在)。仅当下面代码中存在交叉连接线时,才会出现此错误。如果我把这条线拿出来(在 athena/presto 中工作得很好),胶水工作也运行得很好。

metric_value 列中的值示例为 {14311, 242342134, 13132}

以下是查询:

df1 = spark.sql("""
select * FROM database.table
WHERE date(load_date) = current_date - interval '1' day
    """)

df1.createOrReplaceTempView("table1")
    
df2 = spark.sql("""  
SELECT load_date, load_hr, timestamp,
ne_name, object, metric_value, 
MIN(CAST((CASE WHEN trim(split_part) = '' THEN NULL ELSE trim(split_part) END) AS DOUBLE))/100 AS MIN_UTIL, 
MAX(CAST((CASE WHEN trim(split_part) = '' THEN NULL ELSE trim(split_part) END) AS DOUBLE))/100 AS 
MAX_UTIL, AVG(CAST((CASE WHEN trim(split_part) = '' THEN NULL ELSE trim(split_part) 
END) AS DOUBLE))/100 AS AVG_UTIL
FROM table1
cross join UNNEST(CAST(SPLIT(REGEXP_REPLACE(metric_value, '[{}]',''), ',') AS ARRAY<VARCHAR(132)>)) AS t(split_part)
WHERE metric_name = 'util'
group by load_date, load_hr, timestamp, ne_name, object_id, metric_value
""")

预期结果:

加载_日期 加载_小时 时间戳 ne_name 物体 度量值 Min_UTIL MAX_UTIL AVG_UTIL
2023-08-10 15 2023-08-10T09:45 AP1 AP1.12.5 {14311, 242342134, 13132} 3.4 29.1 15.8
sql apache-spark-sql aws-glue amazon-athena
1个回答
0
投票

解决此问题的方法是完全取出交叉连接线并将其添加到 df1,将“unnest”函数更改为“explode”函数。下面的新代码在 pyspark sql AWS Glue 中执行。

 df1 = spark.sql("""
 select *, EXPLODE(CAST(SPLIT(REGEXP_REPLACE(metric_value, '[{}]',''), ',') AS ARRAY<VARCHAR(132)>)) AS raw_value
 FROM database.table
 WHERE date(load_date) = current_date - interval '1' day
""")

df1.createOrReplaceTempView("table1")

df2 = spark.sql("""  
SELECT load_date, load_hr, timestamp,
ne_name, object, metric_value, 
MIN(CAST((CASE WHEN trim(raw_value) = '' THEN NULL ELSE trim(raw_value) END) AS DOUBLE))/100 AS MIN_UTIL, 
MAX(CAST((CASE WHEN trim(raw_value) = '' THEN NULL ELSE trim(raw_value) END) AS DOUBLE))/100 AS 
MAX_UTIL, AVG(CAST((CASE WHEN trim(raw_value) = '' THEN NULL ELSE trim(raw_value)END) AS DOUBLE))/100 AS AVG_UTIL
FROM table1
WHERE metric_name = 'util'
group by load_date, load_hr, timestamp, ne_name, object_id, metric_value
""")
© www.soinside.com 2019 - 2024. All rights reserved.