我看到一个莫名其妙的数组索引引用错误,
Index 1 out of bounds for length 1
...我无法解释,因为我没有看到在 AWS MWAA+EMR Serverless pyspark SQL 查询的上下文中引用任何相关数组。
这是我正在运行的查询:
INSERT INTO TABLE dev.aggregates
PARTITION (p_date='2024-03-03')
BY NAME
SELECT '13x1' AS dimensions
FROM dev.other_aggregates
LIMIT 3
有趣的是,如果我修改查询以去掉
FROM
部分,它会成功执行:
INSERT INTO TABLE dev.aggregates
PARTITION (p_date='2024-03-03')
BY NAME
SELECT '13x1' AS dimensions
这是我的目标插入表定义,来自
show create table dev.aggregates
:
CREATE EXTERNAL TABLE `dev.aggregates`(
`dimensions` string COMMENT '')
PARTITIONED BY (
`p_date` varchar(11) COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://airflow-dev/tables/aggregates/parquet'
TBLPROPERTIES (
" 'parquet.compression'='SNAPPY', "
" 'projection.enabled'='true', "
" 'projection.p_date.format'='yyyy-MM-dd', "
" 'projection.p_date.interval'='1', "
" 'projection.p_date.interval.unit'='DAYS', "
" 'projection.p_date.range'='2023-12-01,NOW', "
" 'projection.p_date.type'='date', "
'storage.location.template'='s3://airflow-dev/tables/aggregates/parquet/p_date=${p_date}')
这里有更多的堆栈跟踪:
File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 66, in <module>
run_spark_commands(spark, sql_commands)
File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 41, in run_spark_commands
spark_runner.sql(command_string)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1631, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o101.sql.
: java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
at java.base/java.util.Objects.checkIndex(Objects.java:361)
at java.base/java.util.ArrayList.get(ArrayList.java:427)
at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:881)
at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:873)
at org.apache.hadoop.hive.ql.metadata.Partition.getSpec(Partition.java:416)
at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1245)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitions$4(HiveClientImpl.scala:832)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.Iterator.foreach(Iterator.scala:943)
...
预先感谢您提供的任何线索。
我们可能遇到了多个问题,但最后一个难题与 WHERE 子句有关(我忽略了在上面的示例中包含该子句):
INSERT INTO table1
SELECT ...
FROM ...
WHERE ...
AND id1 IN (SELECT id2 FROM table2 WHERE dt = '2024-03-08')
...
无论出于何种原因,spark 都没有看到 table2 的 dt = '2024-03-08' 分区,即使我们可以通过 Datagrip Athena 查询它(可能与 table1 投影有关)。因此,我们在 Spark sql 中的 INSERT 查询之前显式地重新添加了分区,并且它起作用了。
ALTER TABLE table2 ADD IF NOT EXISTS
PARTITION (dt='2024-03-08')
LOCATION 's3://airflow-dev/tables/table2/dt=2024-03-08';