莫名其妙的 PySpark SQL 数组索引错误:索引 1 超出长度 1 的范围

问题描述 投票:0回答:1

我看到一个莫名其妙的数组索引引用错误,

Index 1 out of bounds for length 1

...我无法解释,因为我没有看到在 AWS MWAA+EMR Serverless pyspark SQL 查询的上下文中引用任何相关数组。

这是我正在运行的查询:

INSERT INTO TABLE dev.aggregates
  PARTITION (p_date='2024-03-03')
  BY NAME
  SELECT '13x1' AS dimensions
  FROM dev.other_aggregates
  LIMIT 3

有趣的是,如果我修改查询以去掉

FROM
部分,它会成功执行:

INSERT INTO TABLE dev.aggregates
  PARTITION (p_date='2024-03-03')
  BY NAME
  SELECT '13x1' AS dimensions

这是我的目标插入表定义,来自

show create table dev.aggregates
:

CREATE EXTERNAL TABLE `dev.aggregates`(
  `dimensions` string COMMENT '')
PARTITIONED BY ( 
  `p_date` varchar(11) COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://airflow-dev/tables/aggregates/parquet'
TBLPROPERTIES (
"  'parquet.compression'='SNAPPY', "
"  'projection.enabled'='true', "
"  'projection.p_date.format'='yyyy-MM-dd', "
"  'projection.p_date.interval'='1', "
"  'projection.p_date.interval.unit'='DAYS', "
"  'projection.p_date.range'='2023-12-01,NOW', "
"  'projection.p_date.type'='date', "
  'storage.location.template'='s3://airflow-dev/tables/aggregates/parquet/p_date=${p_date}')

这里有更多的堆栈跟踪:

  File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 66, in <module>
    run_spark_commands(spark, sql_commands)
  File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 41, in run_spark_commands
    spark_runner.sql(command_string)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1631, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o101.sql.
: java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1
    at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
    at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
    at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
    at java.base/java.util.Objects.checkIndex(Objects.java:361)
    at java.base/java.util.ArrayList.get(ArrayList.java:427)
    at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:881)
    at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:873)
    at org.apache.hadoop.hive.ql.metadata.Partition.getSpec(Partition.java:416)
    at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1245)
    at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitions$4(HiveClientImpl.scala:832)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
...

预先感谢您提供的任何线索。

pyspark mwaa emr-serverless
1个回答
0
投票

我们可能遇到了多个问题,但最后一个难题与 WHERE 子句有关(我忽略了在上面的示例中包含该子句):

INSERT INTO table1
  SELECT ...
  FROM ...
  WHERE ...
    AND id1 IN (SELECT id2 FROM table2 WHERE dt = '2024-03-08')
  ...

无论出于何种原因,spark 都没有看到 table2 的 dt = '2024-03-08' 分区,即使我们可以通过 Datagrip Athena 查询它(可能与 table1 投影有关)。因此,我们在 Spark sql 中的 INSERT 查询之前显式地重新添加了分区,并且它起作用了。

ALTER TABLE table2 ADD IF NOT EXISTS
  PARTITION (dt='2024-03-08')
  LOCATION 's3://airflow-dev/tables/table2/dt=2024-03-08';
© www.soinside.com 2019 - 2024. All rights reserved.